The Problem

Primary Educatoin Completion Rate

“When it’s shown as an average number of years in school and levels of achievement, the developing world is about 100 years behind developed countries.”

(Winthrop and McGivney 2015 - Brookings Report on Global Educational Inequality)

Despite the immense efforts of bolstering educational attainment worldwide, the world is facing a learning crisis, particularly in the Global South. As a result, development efforts progress towards more inclusive educational practices worldwide. Yet there is a lack of accessible data to understand how education varies sub-nationally, and in a way that accounts for histories of repression and political exclusion that have barred certain groups from equal access to education. For example, while governments and policy-makers debate how education can be inclusive of diverse languages and ethnic groups, no data is currently readily available that accounts for ethnic groups’ educational attainment across time/space.

Much of the data on global education rates aggregate at the country-level and there is no opportunity to account for sub-national variation. Country-level data may not be representative of an entire population. For example, while Morocco may have an 87.8% primary school completion rate, that does not necessitate that this completion rate is evenly (or randomly) dispersed across the population. We should expect that certain groups with histories of political marginalization to have lower education rates; however there is little data that allows for demonstrating this cross-nationally at this time. In other words, while the above quote may prove a shocking comparison between the Global South and North, it may actually be understating the extent of educational inequality of groups that have been historically marginalized within countries.

(For a follow-up on what the primary education completion rate looks like in Morocco at the sub-national level, see the “Example Figures & Analyses” section.)

The Solution: Ethnic Group Education (EGE) Dataset

The following is an R markdown document that highlights the creation of the Ethnic Group Education dataset (EGE). The EGE is a dataset that will ultimately include all major ethnic groups per country-year (1969-2015) and their educational attainment, as no such dataset/information readily exists.

As a preliminary first cut at such a dataset and proof of concept, I first construct the EGE for 35 countries in Africa.

In short, the dataset is constructed by:

Taking Afrobarometer Survey waves 4-6 and merging them into one dataset.
Use the Linking Ethnic Data in Africa Dataset Package in R to link respondents’ languages to ethnic groups across Afrobarometer waves. Specifically I link individuals to ethnic groups as listed in the Ethnic Power Relations dataset, which includes country-year information on ethnic groups and their relative political status (Monopoly, Dominant, Senior Partner, Junior Partner, Powerless, Discriminated, Irrelevant).
Using individual levels of education from 2008-2016 to aggregate and backtrack educational attainment averages per ethnic group per year.

The end result is a dataset that contains country-year information on every major ethnic group in 35 African countries and their corresponding:

Average educational attainment from 1969-2015 for all major ethnic groups in 36 countries in Africa
The corresponding ethnic group information (i.e. is the ethnic group excluded from power, dominant, etc.)

Relevance & Need of EGE

Policy-makers and political scientists studying authoritarian regime maintenance illustrate how education invites both risk and reward for non-democratic states. Education increases pro-democratic attitudes, political dis-engagement, and ultimately autocratic failure. At the same time, political elites in authoritarian countries are predicted to be hesitant towards investing in disenfranchised populations. However, education has also been found to bolster national loyalty, human capital, and long-term development. Nor is the real-world variation clear, non-democratic countries display significant variation in educational investment and attainment in addition to varied relationships between education and political participation.The question, “When do policy makers and political elites in non-democratic states support meaningful education efforts?” remains contested at best.

My research begins to answer this question by investigating two factors: the ethnic diversity of the country and the extent to which the government uses propaganda in schools. Education does not have a uniform effect. Education will not instill similarly pro-democratic attitudes across a diverse population - even if the education “treatment” is constant. In other words, similar educational policies and initiatives across non-democratic countries can have opposite outcomes - jeopardizing or strengthening political stability. Similarly, at the individual level, increased education can lead to individuals becoming supportive of or opposed to their governments.

My research currently focuses on three inter-related questions:

When does education strengthen or weaken national identity?
When does education lead to autocratic stability or democratization?
Do inclusive educational policies that recognize previously marginalized cultures/languages foster a shared or divisive national identity?

The following data construction effort allows us to begin answering each of these questions.

Code & Creation of EGE

The following highlights the construction of the dataset, and then provides some preliminary figures/information using the dataset.

First, I load in all necessary packages and three waves of the “Afrobarometer” surveys. Afrobarometer is a regional survey initiative that asks over 2000 respondents per country in Africa identical questions, and is similar to many other regional survey efforts (Arabometer, Latinobarómetro, Eurobarometer, etc.). Using Afrobarometer allows for a proof of concept that could expand into other regions and ultimately result in a fully global dataset.

# load in all pacakges for rest of code
library(countrycode)
library(ggplot2)
library(foreign)
library(directlabels)
library(tidyr)
library(dplyr)
library(reshape2)
library(stargazer)
library(multiwayvcov)
library(miceadds)
library(jtools)
library(readxl)
library(plyr)
library(haven)
library(stringr)
library(LEDA)
library(gridExtra)
library(sjlabelled)
library(knitr)
library(DT)

# working directories
setwd("~/Google Drive/Ohio State/Dissertation/Ethnic Group Dataset") #macbook

# Load in Afrobarometer surveys
ab4 <- read_sav("merged_r4_data.sav")
ab5 <- read_sav("merged-round-5-data-34-countries-2011-2013-last-update-july-2015.sav")
ab6 <- read_sav("merged_r6_data_2016_36countries2.sav")

Merging Afrobarometer Datasets

Each “wave” of Afrobarometer occurred at a different time and with different respondents.

Wave 4: 20 countries in 2008
Wave 5: 34 countries in 2011-2013
Wave 6: 36 countries 2016

By combining these three waves we can have a significantly larger sample of respondents from which we can calculate average educational levels across time, improving our accuracy and confidence in our estimates.

Afrobarometer Round 4

First, Afrobarometer numbers their countries differently each round. Therefore, I want to get their country names in the data so I can use their Correlates of War (COW) Country Codes. For each round, I created a corresponding excel document that lists the country name and Afrobarometer value. I then can use the Country Code package to standardize the values.

ab.ccodes.r4 <- read_excel("ab_country_codes_r4.xlsx")
#head(ab.ccodes.r4) ## Note: these are the country values *as defined in Afrobarometer Wave 4*
ab.ccodes.r4$ccode <- countrycode(ab.ccodes.r4$Statename, "country.name", "cown") # this package takes the coutnry name and puts in the correlates of war country code
ab4 <- join(ab.ccodes.r4, ab4, by ="COUNTRY")

The datasets are large and we need to clean our variables of interest. We’ll keep the following demographic information from the Afrobarometer Wave 4:

age (Q1)
education (Q89)
language (Q3)
country (COUNTRY)
survey year (DATEINTR)
ethnic or national identity (Q83)
male
employment
urban/rural

For demonstration purposes later on, we also keep the following public opinion information for each respondent.

views on democracy (general)
extent of democracy in [Respondent Country]
satisfaction w/ democracy in [Respondent country]
Trust in President/Prime Minister
Trust in Parliament
Trust in Ruling Party
Perception of ethnic group treatment by government

The following code highlights how each of the above variables is re-coded for ease of analysis and interpretation for Wave 4.

## Age
# Question Number: Q1
# Question: How old are you?
#    Variable Label: Q1. Age
# Values: 18-110, 998-999, -1
# Value Labels: 998=Refused to answer, 999=Don't know, -1=Missing 

ab4$age <- ab4$Q1
ab4$age <- as.numeric(as.character(ab4$age))
ab4$age[ab4$age == -1] <- NA
ab4$age[ab4$age == 998] <- NA
ab4$age[ab4$age == 999] <- NA
#table(ab4$age)

## Education
# Question Number: Q89
# Question: What is the highest level of education you have completed?
# Variable Label: Education of respondent
# Values: 0-9, 99, 998 -1
# Value Labels: 0=No formal schooling, 1=Informal schooling only (including Koranic schooling), 2=Some primary schooling, 3=Primary school completed, 4=Some secondary school/ high school, 5=Secondary school completed/high school completed, 6=Post-secondary qualifications, other than university e.g. a diploma or degree from polytechnic or college, 7=Some university, 8=University completed, 9=Post-graduate, 99=Don’t know, 998=Refused to answer, -1=Missing data

ab4$edu <- as.numeric(ab4$Q89)
ab4$edu[ab4$edu == -1] <- NA
ab4$edu[ab4$edu == 99] <- NA
#table(ab4$edu)

ab4$primary <- ifelse(ab4$edu >= 3, 1, 0)
ab4$secondary <- ifelse(ab4$edu >= 5, 1, 0)
ab4$tertiary <- ifelse(ab4$edu >= 8, 1, 0)

## Language
# Question Number: Q3
# Question: Which [country] language is your home language?
# Variable Label: Language of Respondent
# Values: See codebook
# Value Labebls: See codebook

ab4$language <- ab4$Q3

## Survey Year
# Question: Date of interview
# Variable Label: Date of interview
# Values: 04.03.08 – 31.12.08

# table(ab4$DATEINTR)
# Despite the codebook saying the values are only in 2008, the table indicates that some respondents were interviewed into 2009.
# Therefore, I'll create a new variable that takes the first 4 digits/integers of the DATEINTR variable.

ab4$year <- ab4$DATEINTR
ab4$year <- as.character(ab4$year)
ab4$year <- str_sub(ab4$year, 1, 4)
ab4$year <- as.numeric(ab4$year)
#table(ab4$year)

## Ethnic vs. National Identity
# Question Number: Q83
# Question: Let us suppose that you had to choose between being a [Ghanaian/Kenyan/etc.] and being a ________ [R’s Ethnic Group]. Which of the following best expresses your feelings?
# Variable Label: Ethnic or national identity
# Values: 1-5, 7, 9, 998, -1
# Value Labels: 1=I feel only (R’s ethnic group), 2=I feel more (R’s ethnic group) than [Ghanaian/Kenyan/etc.], 3=I feel equally [Ghanaian/Kenyan/etc.] and (R’s ethnic group), 4=I feel more [Ghanaian/Kenyan/etc.] than (R’s ethnic group), 5=I feel only [Ghanaian/Kenyan/etc.], 7=Not applicable, 9=Don’t know, 998=Refused to answer, - 1=Missing data

ab4$identity <- ab4$Q83
ab4$identity[ab4$identity == -1] <- NA
ab4$identity[ab4$identity == 7] <- NA
ab4$identity[ab4$identity == 9] <- NA

# Urban vs. Rural
## Question Number: URBRUR
## Question: PSU/EA
## Variable Label: Urban or Rural Primary Sampling Unit Values: 1-2
## Value Labels: 1=urban, 2=rural
## Note: Answered by interviewer

ab4$rural <- ab4$URBRUR
ab4$rural <- ab4$rural - 1
#1: rural, 0: urban

# Sex
# Question Number: THISINT
# Question: This interview must be with a: Variable Label: This interview, gender Values: 1, 2
# Value Labels: 1=Male, 2=Female
# Note: Answered by interviewer

ab4$female <- ab4$THISINT
ab4$female <- ab4$female-1
#1: female, 0 : male

# Employed
# Question Number: Q94
# Question: Do you have a job that pays a cash income? Is it full-time or part-time? And are you presently looking for a job (even if you are presently working)?
# Variable Label: Employment status
# Values: 0-5, 9, 998, -1
# Value Labels: 0=No (not looking), 1=No (looking), 2=Yes, part time (not looking), 3=Yes, part time (looking), 4=Yes, full time (not looking), 5=Yes, full time (looking), 9=Don’t know, 998=Refused to answer, -1=Missing data Source: SAB

ab4$employment <- ab4$Q94
ab4$employment[ab4$employment == -1] <- NA
ab4$employment[ab4$employment == 9] <- NA

ab4$employed <- ifelse(ab4$employment > 1, 1, 0)

# View on Democracy
# Question Number: Q30
# Question: Which of these three statements is closest to your own opinion?
# Statement 1: Democracy is preferable to any other kind of government.
# Statement 2: In some circumstances, a non-democratic government can be preferable.
# Statement 3: For someone like me, it doesn’t matter what kind of government we have.
# Variable Label: Support for democracy
# Values: 1-3, 9, 998, -1
# Value Labels: 1=Statement 3: Doesn’t matter, 2=Statement 2: Sometimes non-democratic preferable, 3=Statement 1: Democracy preferable, 9=Don’t know, 998=Refused to answer, -1=Missing data

#table(ab4$Q30)
ab4$democracy <- ab4$Q30
ab4$democracy[ab4$democracy == -1] <- NA
ab4$democracy[ab4$democracy == 9] <- NA
#table(ab4$democracy)

# Extent of Democracy in [Country]
# Question Number: Q42A
# Question: In your opinion how much of a democracy is [Ghana/Kenya/etc.]? today?
# Variable Label: Extent of democracy
# Values: 1-4, 8, 9, 998, -1
# Value Labels: 1=Not a democracy, 2=A democracy, with major problems, 3=A democracy, but with minor problems, 4=A full democracy, 8=Do not understand question/ do not understand what ‘democracy’ is, 9=Don’t know, 998=Refused to answer, -1=Missing data
# Source: Ghana 97

# table(ab4$Q42A)
ab4$democracyInCountry <- ab4$Q42A
ab4$democracyInCountry[ab4$democracyInCountry == -1] <- NA
ab4$democracyInCountry[ab4$democracyInCountry == 8] <- NA
ab4$democracyInCountry[ab4$democracyInCountry == 9] <- NA
# table(ab4$democracyInCountry)

# Satisfied w/ Democracy in [Country]
# Question Number: Q43
# Question: Overall, how satisfied are you with the way democracy works in [Ghana/Kenya/etc.]? Are you: Variable Label: Satisfaction with democracy
# Values: 0-4, 9, 998, -1
# Value Labels: 0=My country is not a democracy, 1=Not at all satisfied, 2=Not very satisfied, 3=Fairly satisfied, 4=Very satisfied, 9=Don’t know, 998=Refused to answer, -1=Missing data
# Source: Eurobarometer

#table(ab4$Q43)
ab4$satisfiedDemInCountry <- ab4$Q43
ab4$satisfiedDemInCountry[ab4$satisfiedDemInCountry == -1] <- NA
ab4$satisfiedDemInCountry[ab4$satisfiedDemInCountry == 9] <- NA
#table(ab4$satisfiedDemInCountry)

# Trust in President
# Question Number: Q49A
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: The President?
# Variable Label: Trust president
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Zambia96
# Note: “Prime Minister” in Lesotho; “President” and “Prime Minister” in Burkina Faso, Cape Verde, Madagascar, Mali, Mozambique, Namibia, Senegal and Zimbabwe; “President” in Benin, Botswana, Ghana, Kenya, Liberia, Malawi, Nigeria, South Africa, Tanzania, Uganda, and Zambia.

ab4$trustPresident <- ab4$Q49A
ab4$trustPresident[ab4$trustPresident == -1] <- NA
ab4$trustPresident[ab4$trustPresident == 9] <- NA
# table(ab4$trustPresident)

# Trust in Parliament
# Question Number: Q49B
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: Parliament?
# Variable Label: Trust parliament/national assembly
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Adapted from Zambia96
# Note: “National Assembly” in Benin, Burkina Faso, Cape Verde, Liberia, Madagascar, Malawi, Mali, Mozambique, Nigeria, Tanzania, Uganda, Zambia; “Parliament” in Botswana, Ghana, Kenya, Lesotho, Namibia, Senegal, ans South Africa; “House of Assembly” in Zimbabwe.

#table(ab4$Q49B)
ab4$trustParliament <- ab4$Q49B
ab4$trustParliament[ab4$trustParliament == -1] <- NA
ab4$trustParliament[ab4$trustParliament == 9] <- NA
#table(ab4$trustParliament)

# Trust in Ruling Party
# Question Number: Q49E
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: The Ruling Party?
# Variable Label: Trust the ruling party
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Adapted from Zambia96

#table(ab4$Q49E)
ab4$trustRP <- ab4$Q49E
ab4$trustRP[ab4$trustRP == -1] <- NA
ab4$trustRP[ab4$trustRP == 9] <- NA
#table(ab4$trustRP)

# Trust Traditional Leaders
# Question Number: Q49I
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: Traditional leaders
# Variable Label: Trust traditional leaders
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Zambia 96

#table(ab4$Q49I)
#ab4$trustTL <- ab4$Q49I
#ab4$trustTL[ab4$trustTL == -1] <- NA
#ab4$trustTL[ab4$trustTL == 7] <- NA
#ab4$trustTL[ab4$trustTL == 9] <- NA
#table(ab4$trustTL)

# Ethnic Group Treated Unfairly
# Question Number: Q82
# Question: How often are ___________s [R’s Ethnic Group] treated unfairly by the government?
# Variable Label: Ethnic group treated unfairly
# Values: 0-3, 7, 9, 998, -1
# Value Labels: 0=Never, 1=Sometimes, 2=Often, 3=Always, 7=Not applicable, 9=Don’t know, 998=Refused to answer, -1=Missing data
# Source: SAB
# Note: Interviewer probed for strength of opinion. If respondent did not identify any group on this question – that is, if they “Refused to answer” (998), said “Don’t know” (999), or “Ghanaian only” (990) – then the interviewer marked “Not applicable” for questions 80-83 and continued to question 84.

#table(ab4$Q82)
ab4$treatedUnfairly <- ab4$Q82
ab4$treatedUnfairly[ab4$treatedUnfairly == -1] <- NA
ab4$treatedUnfairly[ab4$treatedUnfairly == 7] <- NA
ab4$treatedUnfairly[ab4$treatedUnfairly == 9] <- NA
#table(ab4$treatedUnfairly)

Now that all the variables are re-coded and cleaned, we can subset our variables of interest to make our datasets a bit smaller. We can then export it as a “small” variant of the data for use later, if need be. Below is just the first 20 observations of the new condensed Afrobarometer Wave 4 dataset.

# List all variables we want to keep.
myvars <- c("COUNTRY", "Statename", "ccode", "RESPNO", "age",
            "edu", "primary", "secondary", "tertiary", "language", "year", "identity", "rural", "female", "employment", "employed", "democracy",
            "democracyInCountry", "satisfiedDemInCountry", "trustPresident", "trustParliament", "trustRP", "treatedUnfairly")

# This subsets the afrobarometer dataset to only include the above variables.
ab4 <- ab4[myvars]


save(ab4, file = "ab4_small.Rda")


datatable(ab4[1:20,])

Repeat for Afrobarometer Round 5 & 6

Now I want to do the same for Round 5 and Round 6 of Afrobarometer. I do not replicate the code below, but it is otherwise identical in execution to the code for Round 4, except that certain variables are identified by different numbers in each wave.

Merging Waves 4-6

Once each Wave is cleaned, re-coded, and condensed - they are now identical in that they each have the following variables:

COUNTRY
Statename
ccode
RESPNO
age
edu (+ primary, secondary, tertiary)
rural
female
employment (status)
employed (binary)
democracy
democracy in country
trust in president
trust in parliament
trust in ruling party
whether ethnic group is treated unfairly
language
year
identity

Given they have the same order of the columns as well, I could rbind() them; however, the respondent ID variables (“RESPON”) will then be duplicated, as they are just a series of numbers for each wave. I.e., Respondent #4 in Wave 4 is different than Respondent #4 in Wave 5. Therefore, the first thing I do is add the year to each of the RESPNO.

ab4$RESPNO <- as.character(ab4$RESPNO)
ab4$RESPNO <- str_c(ab4$RESPNO, "-", ab4$year)

ab5$RESPNO <- as.character(ab5$RESPNO)
ab5$RESPNO <- str_c(ab5$RESPNO, "-", ab5$year)

ab6$RESPNO <- as.character(ab6$RESPNO)
ab6$RESPNO <- str_c(ab6$RESPNO, "-", ab6$year)

head(ab4$RESPNO) # Example

## [1] "BEN0001-2008" "BEN0002-2008" "BEN0003-2008" "BEN0004-2008" "BEN0005-2008"
## [6] "BEN0006-2008"

Linking Ethnic Data with EPR

Now, I want to use the Linking Ethnic Data in Africa Dataset package to use the language of each respondent as an indicator of ethnicity, which I can then link to other datasets (such as EPR). The Ethnic Power Relations dataset includes country-year information on ethnic groups and their relative political status (Monopoly, Dominant, Senior Partner, Junior Partner, Powerless, Discriminated, Irrelevant).

LEDA() lets me produce a dataset that includes the language name from Afrobarometer and it’s corresponding ethnic group from EPR. Here’s an example.

leda <- LEDA$new()

# Retrieve dataset dictionary
list.dict <- leda$get_list_dict()


## Link all Afrobarometer groups to EPR data for round 4
setlink.ab4 <- leda$link_set(lists.a = list(type = c("Afrobarometer"),
                                        round = 4, marker = "language"),
                         lists.b = list(type = c("EPR")),
                         link.level = "dialect",
                         by.country = T,
                         drop.a.threshold = 0,
                         drop.b.threshold = 0,
                         drop.ethno.id = F)

#Subsetting year
setlink.ab4 <- setlink.ab4[which(setlink.ab4$b.year==2008),]


# Now for Round 5
setlink.ab5 <- leda$link_set(lists.a = list(type = c("Afrobarometer"),
                                        round = 5, marker = "language"),
                         lists.b = list(type = c("EPR")),
                         link.level = "dialect",
                         by.country = T,
                         drop.a.threshold = 0,
                         drop.b.threshold = 0,
                         drop.ethno.id = F)

#Subsetting year
setlink.ab5<- setlink.ab5[which(setlink.ab5$b.year==2011),]


## Now for Round 6
setlink.ab6 <- leda$link_set(lists.a = list(type = c("Afrobarometer"),
                                        round = 6, marker = "language"),
                         lists.b = list(type = c("EPR")),
                         link.level = "dialect",
                         by.country = T,
                         drop.a.threshold = 0,
                         drop.b.threshold = 0,
                         drop.ethno.id = F)

#Subsetting year
setlink.ab6 <- setlink.ab6[which(setlink.ab6$b.year==2015),]

## Have a look
head(setlink.ab4[, c("a.group", "b.group", "a.type", "b.type")])

##     a.group                              b.group        a.type b.type
## 39     Adja                  Southwestern (Adja) Afrobarometer    EPR
## 97     Adja                  Southwestern (Adja) Afrobarometer    EPR
## 166    Adja Southeastern (Yoruba/Nagot and Goun) Afrobarometer    EPR
## 213    Adja                  Southwestern (Adja) Afrobarometer    EPR
## 282    Adja Southeastern (Yoruba/Nagot and Goun) Afrobarometer    EPR
## 329    Adja                  Southwestern (Adja) Afrobarometer    EPR

Next, I need to load in the corresponding ``Language ID Number’’ and ‘’Language Name’’ from Afrobarometer. The following excel documents were created using the code-books from Afrobarometer. In short, I would copy and paste the delimmeted list from the code-books of the ID=language and use excel to automatically make them into individual rows/columns.

lang.r4 <- read_excel("languages_r4.xlsx")
lang.r5 <- read_excel("languages_r5.xlsx")
lang.r6 <- read_excel("languages_r6.xlsx")

# Each of these documents contain a "language" which corresponds to the language ID numbers from the Afrobarometer language question.
# Each of these documents also contain a corresponding "languageName" which correspodns to the name of the language for each ID from the Afrobarometer codebook.

Let’s run a few checks to see how much the language IDs from the code-books and LEDA() match. Below, I have R output differences between the two for each wave.

link.ab4 <- setlink.ab4[, c("a.cowcode", "a.iso3c", "a.group", "b.group", "a.type", "b.type")]
link.ab4 <- link.ab4[!duplicated(link.ab4), ]
setdiff(link.ab4$a.group, lang.r4$languageName)

##  [1] "Fuls"                                    
##  [2] "Moore"                                   
##  [3] "Senoufo"                                 
##  [4] "Arabe"                                   
##  [5] "Khassonke"                               
##  [6] "Malinke"                                 
##  [7] "Soninke/ Sarakoll"                       
##  [8] "Sonrhai"                                 
##  [9] "Mang'anja"                               
## [10] "Oshiwambo"                               
## [11] "Ijaw/Kalabari/Okirika/Andoni/Ogoni/Nembe"

link.ab5 <- setlink.ab5[, c("a.cowcode", "a.iso3c", "a.group", "b.group", "a.type", "b.type")]
link.ab5 <- link.ab5[!duplicated(link.ab5), ]
setdiff(link.ab5$a.group, lang.r5$languageName)

##  [1] "Moore"                               "Senoufo"                            
##  [3] "Baoule"                              "Bete"                               
##  [5] "Godie"                               "Guere"                              
##  [7] "Diakanke"                            "konianke"                           
##  [9] "Maasai / Samburu"                    "Meru / Embu"                        
## [11] "\"Official\" Malagasy"               "Khassonke"                          
## [13] "Malinke"                             "Peulh / Fulfude"                    
## [15] "Soninke / Sarakolle"                 "Sonrhai"                            
## [17] "Chimang'anja"                        "Oshiwambo (Oshindonga/Oshikwanyama)"
## [19] "Beri beri"                           "Zarrma/Songhai"                     
## [21] "Kabye"

link.ab6 <- setlink.ab6[, c("a.cowcode", "a.iso3c", "a.group", "b.group", "a.type", "b.type")]
link.ab6 <- link.ab6[!duplicated(link.ab6), ]
setdiff(link.ab6$a.group, lang.r6$languageName)

##  [1] "Moore"                               "Senoufo"                            
##  [3] "Baoule"                              "Bete"                               
##  [5] "Godie"                               "Guere"                              
##  [7] "Bangangte"                           "Foufoulde"                          
##  [9] "Mbede"                               "Myene"                              
## [11] "Nzebi/Metie"                         "Punu/Merie"                         
## [13] "Malgache << officiel >>"             "Malgache avec specificite regionale"
## [15] "Khassonke"                           "Malinke"                            
## [17] "Soninke/Sarakole"                    "Portuguese"                         
## [19] "Zarma/Songhai"

So there is some mis-match, but not much, across each list. Therefore, I change/fix the spelling of any languages that are obvious matches - i.e. those with just a one letter difference (which I corroborated to be an alternative spelling online), or difference in accent mark (which LEDA() does not include in any spelling), etc. I do this to match the list of languages as they are spelled in the LEDA() function. Therefore, I fix the spelling as it is in my excel documents.

lang.r4 <- read_excel("languages_r4_sp_fixed.xlsx")
setdiff(link.ab4$a.group, lang.r4$languageName) # Remaining mis-matches

## [1] "Arabe"                                   
## [2] "Senufo/ Mianka"                          
## [3] "Soninke/ Sarakoll"                       
## [4] "Ijaw/Kalabari/Okirika/Andoni/Ogoni/Nembe"

lang.r5 <- read_excel("languages_r5_sp_fixed.xlsx")
setdiff(link.ab5$a.group, lang.r5$languageName) # Remaining mis-matches

## [1] "\"Official\" Malagasy" "Senufo"

lang.r6 <- read_excel("languages_r6_sp_fixed.xlsx")
setdiff(link.ab6$a.group, lang.r6$languageName) # Remaining mis-matches

## [1] "Senufo"

These mis-matches will remain as there seem to not be links available.

Individual Datasets

So at this point I have three datasets for each Afrobarometer round:

ab# - the Afrobarometer Dataset that contains the survey information.
lang.r# - the language number and corresponding language name from each Afrobarometer round
link.ab# - the Afrobarometer language name and corresponding EPR name for each round

Let’s quickly take a look at each.

datatable(ab4[1:10,])

datatable(lang.r4[1:10,])

datatable(link.ab4[1:10,])

colnames(link.ab4)[1] <- "ccode"
colnames(link.ab4)[2] <- "StatenameCond"
colnames(link.ab4)[3] <- "languageName"
colnames(link.ab4)[4] <- "group"
ab4 <- join(ab4, lang.r4, by = "language")

datatable(ab4[1:10,])

Now I want to merge EPR with the link.ab# list.

Loading EPR

There are several iterations of aggregating EPR. Let’s first look at the structure of the data. As you can see, it is a condensed year format, so we need to expand and subset it.

# So there are several iterations of aggregating EPR.
# Let's first look at their structure of it.
EPR <- read.csv("EPR-2014.csv")
datatable(EPR[1:20,])

# As you can see, it's a condsensed year format - so I expand it so I can subset it.
EPR$year <- mapply(seq, EPR$from, EPR$to, SIMPLIFY=FALSE)

EPR <- EPR %>% 
    unnest(year) %>% 
    select(-from,-to)

#Subsetting year, lets do just 2008 for AB4 as of now
EPR <- EPR[which(EPR$year==2008),]

EPR$ccode <- countrycode(EPR$statename, "country.name", "cown")

## Subset just countries in AB4.
#table(ab4$ccode)
ab4ccode <- as.data.frame(table(ab4$ccode))
ab4ccode <- as.vector(ab4ccode$Var1)
#head(ab4ccode) #list of country codes (COW) in Ab4

EPR <- subset(EPR, EPR$ccode %in% ab4ccode)
#table(EPR$ccode)

link.ab4 <- merge(link.ab4, EPR, by = c("ccode", "group"))
# I think at this point I need to print this out via excel, or smoething, and manually delete doubles that don't DIFFER in information.
# Specifically, i'm looking for doubles of "languageName" that otherwise contain the same relevant information from EPR.

write.csv(link.ab4, file = "link_ab4.csv")

So at this point, given how I am linking languages using LEDA() - some languages in Afrobarometer may correspond to multiple ethnic groups in the same country. In other words, 1 language may correspond to multiple ethnic groups, and each ethnic group may have different statuses. Therefore, I adopt the following coding rules.

In the picture below, you can see that the language “Adja” (from the Afrobarometer Wave 4 language response) corresponds to two ethnic groups in Benin - groups in the Southwest and Southeast. The same applies for the language “Goun”. In this case - regardless of the language*ethnic group, their “status” does not change. In other words, all individuals who speak “Adja” (in the SW or SE) are “Junior Partners” and all individuals who speak “Goun” are “Junior Partners”. In such cases, I simply collapse the observations (or remove 1 from each, so that I don’t get duplicates when merging).

Example 1 - Excel Screencap

The next example gives us two other possibilities. In the first case (light green) the language “Akan” is linked to two ethnic groups in Ghana - the Asanta (Akan) and the “Other Akans”. In this case, the Asante are “Senior Partner” and Other Akans are “Junior Partner”. In such cases, I prioritize whether or not they have power (i.e., I don’t intend to differentiate between senior/junior partner). Therefore, I delete the “Other Akan” group.

Alternatively, the language “Ijaw/Kalabari/Okirika/Andoni/Ogoni/Nembe” (dark greeen) in Nigeria applies to two ethnic groups, the Ijaw and Ogoni; however, the Ijaw are “Junior Partner” and the Ogoni are “Powerless”. Therefore, I delete both as I am unable to differentiate them in the analysis.

Example 2 - Excel Screencap

By and large, the majority of countries had no duplicates. Of those that did have duplicates, it was only 1-2 groups. The exception was Namibia - in which nearly all languages coincided with multiple ethnic groups that ultimately had different statuses. I still followed the above rules.

Example 3 - Excel Screencap

link.ab4.fixed <- read.csv("link_ab4_fixed.csv")
link.ab4.fixed$X <- NULL
ab4 <- join(ab4, link.ab4.fixed, by = c("ccode", "languageName"))

So at this point, I’ve fully merged the Afrobarometer Wave 4 dataset with EPR, where individuals were connected to ethnic groups based upon their language, where dialect linked individuals to certain ethnic groups as defined in EPR. Let’s take a look at how many individuals now in Afrobarometer Wave 4 have corresponding information in EPR.

nrow(ab4) ### number of respondents in Afrobarometer Wave 4

## [1] 27713

sum(is.na(ab4$status)) ## number of individuals with no EPR status

## [1] 8481

nrow(ab4) - sum(is.na(ab4$status)) ## number of individuals with EPR status

## [1] 19232

(nrow(ab4) - sum(is.na(ab4$status))) / nrow(ab4) ### percent coverage of EPR and Afrobarometer Wave 4

## [1] 0.6939703

Nearly 70% of respondents in Afrobarometer Wave 4 have a corresponding EPR group. The missing 30% likely exists, but I had to drop the information because languages corresponding to ethnic groups with conflicting statuses. I do hope to return to this in the future - but for now I move forward as a proof of concept.

At this point, I want to repeat the above steps (beginning with loading EPR) two more times - one for each year of Afrobarometer Wave 5 (2011) and Afrobarometer Wave 6 (2015).

Repeat for AB5 and AB6

I repeat the above steps (beginning with “Loading EPR”) for AB5 and AB6, but I do not replicate it below.

As with above, there are similar problems where 1 language coincides with multiple ethnic groups. This is particularly true in North Africa, as shown below.

In the case of Morocco, Arabic coincides with two ethnic groups - Arabs and Sahrawis. Arabs are dominant, and Sahrawis are discriminated. However, Arabic-speaking Saharwis only make up .016 of the population. See similar issues Arabic being the sole-language in Sudan and Egypt. Example 4 - Excel Screencap

Therefore, the only change I make is that if there are multiple ethnic groups to a single language, and one of the ethnic groups is less than .1 percent of the population (often much smaller) and is powerless, I delete that group in favor of the ethnic group that is much larger and has power. This is to better capture scenarios where very small marginalized ethnic groups (who likely are not even picked up by Afrobarometer surveys) speak the same language as larger empowered groups. Therefore, in the above, I delete the information from row 277 and 278.

Vertical Merge

At this point in time, I have the Afrobarometer Wave 4-6 surveys merged with EPR. For roughly 70% of all respondents, I have their corresponding Ethnic Group Status information (which is not included in Afrobarometer).

myvars <- c("COUNTRY", "Statename", "ccode", "RESPNO", "age",
            "edu", "primary", "secondary", "tertiary", "language", "year", "identity", "rural", "female", "employment", "employed", "democracy", "democracyInCountry", "satisfiedDemInCountry", "trustPresident", "trustParliament", "trustRP", "treatedUnfairly", "languageName", "group", "size", "status")

ab4 <- ab4[myvars]
ab5 <- ab5[myvars]
ab6 <- ab6[myvars]

save(ab4, file = "ab4_final.Rda")
save(ab5, file = "ab5_final.Rda")
save(ab6, file = "ab6_final.Rda")

aball <- bind_rows(ab4,ab5)
aball.new <- bind_rows(aball,ab6)

save(aball.new, file = "ab_all_final_new.Rda")
write.csv(aball.new, file = "ab_all_final_new.csv")


datatable(aball[1:20,])

To note, the ``ab_all_final.Rda’’ dataset is the individual respondent information from Afrobarometer that can be used to tackle Question 1.

Please see the ``clott_egd_q1_rmark.Rmd’’ file for a preliminary data analysis for Question 1.

Backtracking Age Profiles

Let’s try to create an aggregate now. So “aball” contains three waves of Afrobarometer:

Afrobarometer Wave 4 - Conducted in 2008
Afrobarometer Wave 5 - Conducted from 2011-2013
Afrobarometer Wave 6 - Conducted from 2014-2015

I have a survey in which respondents were surveyed at different times, therefore their ages are not standardized - which means I need to fix this problem before backtracking age-cohort profiles. To solve this, I increase everyone’s age dependent upon when the survey was conducted. For instance, a 35 year old who was surveyed in 2008 would now (presumably, if alive) be 47. Of course this creates issues if someone was already elderly in 2008 (say ages 85+); however, we can keep them in the survey since we are assuming all education would be attained at a younger age. Therefore, keeping their observations helps bolster our estimates of earlier years. Furthermore, I’m only standardizing from 2015 - the last year we have information.

From this information, my preliminary attempt at getting educational attainment rates per country group is to do the following:

Create a for loop that each iteration aggregates the average educational attainment per ethnic group per year, but then limit the respondent sample by age as I backtrack through time.
Each loop though removes respondents based on their age, with each loop corresponding to one year.
For instance, the first loop takes the average educational attainment of each ethnic group - using the educational attainment of all respondents (ages 18+). This gives us the educational attainment for 2015.
The second loop takes the average educational attainment of each ethnic group - using the educational attainment of all respondents ages 19+. This gives us the educational attainment for 2014.
The third loop takes the average educational attainment of each ethnic group - using the educational attainment of all respondents 20+. This gives us the educational attainment for 2013.
Repeat until I have information through 1969.

This method of backtracking group-year information is based upon the assumption that education (at least primary and secondary) will be completed in the first 18 years of respondents’ life, on average. Tertiary education will still be captured by individuals older than 18.

aball$timeDiff <- 2015 - aball$year
aball$ageUpdate <- aball$age + aball$timeDiff

# Create at list the average education levels as they stand in 2015.
egeAll <-aggregate(aball$edu, by=list(Statename = aball$Statename, group = aball$group),
                      FUN=mean, na.rm = T)
egeAll <- as.data.frame(egeAll)
egeAll <- egeAll[with(egeAll, order(Statename, group)),]
colnames(egeAll)[3] <- "2015.1"



# I want to first include information on their EPR status before I put in all the education information.

myvars <- c("Statename", "group", "status")
abGroupStatus <- aball[myvars]
abGroupStatus <- unique(abGroupStatus)

#egeAll <- join(egeAll, abGroupStatus, by = c("Statename", "group"))
#egeAll$included <- ifelse(egeAll$status == "MONOPOLY" | egeAll$status == "DOMINANT" | egeAll$status == "SENIOR PARTNER" | egeAll$status == "JUNIOR PARTNER", 1, 0)

year <- 2015

# Here's my loop. Where i reflects ages 18-65.
for (i in 18:65) {
    abTemp <- aball[ which(aball$ageUpdate >= i),]
    table.temp <- aggregate(abTemp$edu, by=list(Statename = abTemp$Statename, group = abTemp$group),
                            FUN=mean, na.rm = T)
    table.temp <- as.data.frame(table.temp)
    table.temp <- table.temp[with(table.temp, order(Statename, group)),]
    colnames(table.temp)[3] <- year
    
    year <- year-1
    egeAll <- merge(egeAll, table.temp, by = c("Statename", "group"), all = T)
    
}

egeAll$`2015.1` <- NULL
datatable(egeAll[1:20,])

save(egeAll, file = "ege_all.Rda")

And voila! We have a dataset that has ethnic group education level per country year. Let’s melt it and add in the EPR information again. The trick is that the EPR information (whether or not a group was discriminated/powerless/in power etc. changes historically. So we actually want to merge this new dataset with the old EPR dataset. Therefore, we can say that “X Group was Discriminated in 1975, and their education level was Y.”

Then we can look at a couple figures of what we have.

egeMelt <- melt(egeAll, id.vars = c("Statename", "group"))
colnames(egeMelt)[1] <- "statename"
colnames(egeMelt)[3] <- "year"
colnames(egeMelt)[4] <- "Edu"

head(egeMelt)

##   statename                                group year      Edu
## 1   Algeria                                Arabs 2015 3.195989
## 2   Algeria                              Berbers 2015 2.436170
## 3     Benin                  South/Central (Fon) 2015 2.376828
## 4     Benin Southeastern (Yoruba/Nagot and Goun) 2015 2.083770
## 5     Benin                  Southwestern (Adja) 2015 1.954301
## 6  Botswana                                Birwa 2015 3.659574

EPR <- read.csv("EPR-2014.csv")
EPR$year <- mapply(seq, EPR$from, EPR$to, SIMPLIFY=FALSE)
EPR <- EPR %>% 
    unnest(year) %>% 
    select(-from,-to)


EPR$ccode <- countrycode(EPR$statename, "country.name", "cown")

## Subset just countries in aball
#table(aball$ccode)
aballccode <- as.data.frame(table(aball$ccode))
aballccode <- as.vector(aballccode$Var1)
#head(aballccode) #list of country codes (COW) in aball

EPR <- subset(EPR, EPR$ccode %in% aballccode)
#table(EPR$ccode)

myvars <- c("statename", "group", "year", "status")
EPR <- EPR[myvars]

egeMelt <- join(egeMelt, EPR, by = c("statename", "group", "year"))

egeMelt$included <- ifelse(egeMelt$status == "MONOPOLY" | egeMelt$status == "DOMINANT" | egeMelt$status == "SENIOR PARTNER" | egeMelt$status == "JUNIOR PARTNER", 1, 0)

countryList <- egeMelt$statename
countryList <- unique(countryList)


plots <- list()
for (i in 1:35) {
  #Can do any country below.
  figure<-ggplot(na.omit(egeMelt[ which(egeMelt$statename==countryList[i]),]), aes(x = factor(year, levels = rev(levels(factor(year)))), y = Edu, group = as.factor(group), colour = as.factor(included))) +
        geom_point() +
        geom_line() +
        theme(legend.position = "bottom") +
        xlab("Year") +
        ggtitle(countryList[i]) +
      scale_x_discrete(breaks=seq(1970,2015, 5)) + labs(color='Included in Power') 
  
  plots[[i]] <- figure
   #ggsave(figure, file=paste0("plot_", countryList[i],".png"))
  #print(figure)
}

save(egeMelt, file = "ege_melt.Rda")
write.csv(egeMelt, file = "ege_melt.csv")
datatable(egeMelt[1:20,])

Example Figures & Analyses

The previous section (“Code & Creation of EGE”) leaves us with two novel datasets which we can explore further.

A regional Ethnic Group Education dataset (EGE) for Africa. This dataset includes all ethnic group educational attainment rates from 1969-2015.
An Afrobarometer Survey (Waves 4-6, 2008-2016) that has been merged with Ethnic Power Relations data. In other words, we can connect individuals to their ethnic groups along with information on said groups that was not originally included in Afrobarometer.

All of the following figures/visualizations are made by myself using Tableau and R (ggplot2). Original sources of information are included in each figure.

Ethnic Group Education (EGE) Dataset

The preliminary EGE dataset contains ethnic group educational attainment rates from 1969-2015 in the following countries:

##  [1] "Algeria"       "Benin"         "Botswana"      "Burkina Faso" 
##  [5] "Burundi"       "Cameroon"      "Cote d’Ivoire" "Egypt"        
##  [9] "Ghana"         "Guinea"        "Kenya"         "Lesotho"      
## [13] "Liberia"       "Madagascar"    "Malawi"        "Mali"         
## [17] "Mauritius"     "Morocco"       "Mozambique"    "Namibia"      
## [21] "Niger"         "Nigeria"       "Senegal"       "Sierra Leone" 
## [25] "South Africa"  "Swaziland"     "Tanzania"      "Tunisia"      
## [29] "Uganda"        "Zambia"        "Zimbabwe"

For each year, it also includes the Ethnic Power Relations (EPR) ethnic group status information (whether or not the group was a monopoly, dominant, senior partner, junior partner, powerless, discriminated, or irrelevant). I’ve coded that to be “in power” or “not in power”, as defined by EPR. Groups that are “monopoly, dominant, senior partner, or junior partner” are considered to be in included in politics, whereas groups that are discriminated or powerless are considered to be excluded from politics.

Here is a look at the first 100 observations of the dataset.

With this dataset, we can create some figures to explore cross-national trends on the relationship between educational attainment and political exclusion across ethnic lines.

We do observe a noticeable difference between the educational attainment, on average across all 36 countries, between politically included and excluded groups. While the trends seem to suggest a convergence, as one would theorize with major educational initiatives towards the end of the 21st century, there has been a lag between politically included and excluded groups.

Educational Attainment

Alternatively, we can look at specific countries of interest and see if certain ethnic groups historically have higher education rates. Below are examples from Namibia and Morocco; however, we can create identically formatted figures for all 36 countries.

Educational Attainment in Namibia Educational Attainment in Morocco This last figure is particularly insightful. The figure highlights the average educational attainment (in a box-and-whisker format with confidence bands) of both women and men, dependent upon their identification with a politically included or excluded group. There are 8 possible group averages.

For six of these groups, members on average complete primary school (the red line). Only women who identify as members of ethnic groups excluded from politics do not finish primary school, on average.

Average Education

Education, Ethnic Group Identity, and Political Attitudes

In constructing the EGE dataset, we also necessarily created an Afrobarometer Survey (Waves 4-6, 2008-2016) that has been merged with Ethnic Power Relations data. In other words, we can connect individuals to their ethnic groups along with information on that was not originally included in Afrobarometer.

My research indicates that education uniquely impacts marginalized groups as opposed to advantaged groups. I hypothesize that education will foster comparatively critical views of the government among marginalized individuals. Alternatively, education will foster state-support among advantaged groups to protect the status-quo.

The following figure investigates the extent to which respondents trust the ruling party in their country as a reflection of their average education levels, broken down by political inclusion status. Among political excluded groups (blue bars), increased education is associated with decreased levels of trust in the ruling party. However, there does seem to be an overall trend in this direction as well; therefore, it is unclear whether this would produce a statistically significant relationship.

Trust in Ruling Party

Using the EGE and Afrobarometer/EPR merged dataset, I have done additional research on how political exclusion interacts with educational propaganda to impact citizens’ national identities and preferred politics. To note, not all education systems are similarly inclusive. Below is a visualization constructed using “Varieties of Democracy” data on the extent to which governments “respect academic freedom and cultural expression.” Higher scores (max: 4, darker colors) indicate greater respect for academic freedom and diverse cultural expression in educational settings.

Worldwide Academic Freedom

I’ve even merged this information with the EGE and Afrobarometer datasets. Using the individual level survey data from Afrobarometer I can predict when individuals are more likely to identify with the national or ethnic group. In short, I argue that education will lead to marginalized individuals being more likely to identify with their ethnic group, as opposed to their nation, when they are presented with culturally exclusive and propaganda-based education. My research ends up providing support for the hypothesis.

For the full paper please email me (clott.1 [at] osu.edu). I provide the main figure/table below. The unit of analysis is the individual, and the dataset contains over 80,000 respondents from Waves 4-6 of Afrobarometer. The key takeaway is that for individuals who belong to politically excluded ethnic groups, educational propaganda (mono-lingual exclusive policies, cultural erasure, etc.) backfires and leads to these individuals to be more likely to identify with their ethnic group as opposed to the nation.

Predicted Probability of Respondent Identity

Regression Output

Looking Ahead

There are a handful of obstacles yet to overcome as this project moves towards a global dataset construction.

Across all respondents in the Afrobarometer surveys 4-6, only 70% are currently matched with EPR. This is likely due to my having to drop instances where 1 language matches 2+ incompatible ethnic groups. However, most of this overlap is primarily limited to one country.
This is only information on Africa (and 36 countries therein). Global construction will necessitate use and merging of multiple regional survey efforts (the barometers) or with global datasets (World Value Survey, Demographic and Health Surveys).
I don’t currently have confidence intervals for each estimate. Ideally the dataset will provide upper- and lower-bound bands for each education estimate, dependent upon the number of respondents that are aggregated in that estimation. Alternatively, I am also considering multi-level regression with post stratification where I can combine information from multiple datasets to create stronger estimates.

Sources & More Information

Global Ethnic Group Education Dataset

Alec Clott

1/10/2021