I am broadly interested in poverty alleviation and thus selected a UNICEF data set that contains adult literacy data by country. Literacy rate is defined as the percentage of the population aged 15 years and over who can both read and write with understand a short simple statement on his/her everyday life. In general, highly developed countries do not have rate values in this data set since they are assumed to have completely literate populations; these countries’ rates are indicated by “-”.
Data source: Data and Analytics Section; Division of Data, Research and Policy, UNICEF. (2015). “Youth and adult literacy rates”. Accessed from https://data.unicef.org/wp-content/uploads/2016/05/education_table-youth-and-adult-literacy-rate-updated-oct.-2015.xlsx
I start by reading in the data and checking its structure.
library(tidyr)
library(dplyr)
library(ggplot2)
literacy <- read.csv("https://raw.githubusercontent.com/chrosemo/data607_fall19_project2/master/literacy.csv", skip=9)
head(literacy)
## ISO.Code Countries.and.areas Reference.year.s. Total X Sex X.1 X.2
## 1 NA Male NA Female
## 2 AFG Afghanistan 2011 32 NA 45 NA 18
## 3 ALB Albania 2011 97 NA 98 NA 96
## 4 DZA Algeria 2006 73 NA 81 NA 64
## 5 AND Andorra - NA - NA -
## 6 AGO Angola 2013 71 NA 82 NA 60
## X.3 Source
## 1 NA
## 2 NA UNESCO Institute for Statistics
## 3 NA UNESCO Institute for Statistics
## 4 NA UNESCO Institute for Statistics
## 5 NA
## 6 NA UNESCO Institute for Statistics
str(literacy)
## 'data.frame': 215 obs. of 10 variables:
## $ ISO.Code : Factor w/ 198 levels "","AFG","AGO",..: 1 2 4 52 5 3 9 7 8 10 ...
## $ Countries.and.areas: Factor w/ 212 levels ""," Eastern and Southern Africa",..: 1 4 5 6 7 8 9 10 11 12 ...
## $ Reference.year.s. : Factor w/ 15 levels ""," - Data not available.",..: 1 11 11 6 1 13 13 13 11 1 ...
## $ Total : Factor w/ 57 levels "","-","100","15",..: 1 9 55 33 2 31 57 56 3 2 ...
## $ X : logi NA NA NA NA NA NA ...
## $ Sex : Factor w/ 54 levels "","-","100","23",..: 54 9 52 35 2 36 52 52 3 2 ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ X.2 : Factor w/ 62 levels "","-","100","12",..: 62 5 58 33 2 29 61 60 3 2 ...
## $ X.3 : logi NA NA NA NA NA NA ...
## $ Source : Factor w/ 2 levels "","UNESCO Institute for Statistics": 1 2 2 2 1 2 2 2 2 1 ...
I select the country-specific rows, filter out the countries with effectively complete literacy (no data), remove irrelevant columns, and rename the sex-specific columns. Noting that the ‘Male’ and ‘Female’ columns are factors with different numbers of levels, I convert both columns and the ‘Total’ column to character format prior to gathering the data into long format. I then convert the ‘Total’ and gathered ‘Rate’ columns to numeric format.
literacy <- literacy %>% slice(2:198) %>% filter(Total != '-') %>% select(-c(1,5,7,9,10)) %>% rename('Male' = 'Sex', 'Female' = 'X.2')
literacy$Total <- as.character(literacy$Total)
literacy$Male <- as.character(literacy$Male)
literacy$Female <- as.character(literacy$Female)
literacy <- gather(literacy, Sex, Rate, -1, -2, -3)
literacy$Total <- as.numeric(literacy$Total)/100
literacy$Rate <- as.numeric(literacy$Rate)/100
head(literacy)
## Countries.and.areas Reference.year.s. Total Sex Rate
## 1 Afghanistan 2011 0.32 Male 0.45
## 2 Albania 2011 0.97 Male 0.98
## 3 Algeria 2006 0.73 Male 0.81
## 4 Angola 2013 0.71 Male 0.82
## 5 Antigua and Barbuda 2013 0.99 Male 0.98
## 6 Argentina 2013 0.98 Male 0.98
No specific analysis is noted, so I embark on an exploratory data analysis. I start by checking the distribution of total adult literacy rates by country. The rates range from 0.150 to 1.000, with an IQR of 0.270, a median of 0.900, and a mean of 0.813. The distribution itself is heavily skewed left.
summary(literacy$Total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1500 0.7100 0.9000 0.8127 0.9800 1.0000
ggplot(data=literacy, mapping=aes(x=Total)) +
geom_histogram(color="black", fill="white", bins=50) +
geom_vline(data=literacy, aes(xintercept=mean(Total)), linetype="dashed")
The sex-specific distributions of literacy rates are similar in shape (heavily left skewed) to the distribution of total rates. The median (0.925) and mean (0.852) for males exceed the same (0.895 and 0.775, respectively) for females. By contrast, the range (0.910) and IQR (0.340) for females are both wider than their male-specific counterparts (0.770 and 0.200, respectively).
tapply(literacy$Rate, literacy$Sex, summary)
## $Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0900 0.6300 0.8950 0.7748 0.9700 1.0000
##
## $Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.7800 0.9250 0.8515 0.9800 1.0000
ggplot(data = literacy, mapping=aes(x=Sex, y=Rate)) +
geom_boxplot(outlier.color="red") +
coord_flip()
Next, I look at the countries with the highest and lowest literacy rates, total and by sex. Total, 18 countries have 100% literacy; for females, 16 countries; for males, 21 countries.
literacy %>% top_n(10, Total)
## Countries.and.areas Reference.year.s. Total Sex
## 1 Armenia 2011 1 Male
## 2 Azerbaijan 2010 1 Male
## 3 Belarus 2009 1 Male
## 4 Cuba 2012 1 Male
## 5 Democratic People's Republic of Korea 2008 1 Male
## 6 Estonia 2011 1 Male
## 7 Georgia 2013 1 Male
## 8 Kazakhstan 2009 1 Male
## 9 Latvia 2011 1 Male
## 10 Lithuania 2011 1 Male
## 11 Palau 2013 1 Male
## 12 Poland 2013 1 Male
## 13 Russian Federation 2010 1 Male
## 14 Slovenia 2013 1 Male
## 15 Tajikistan 2013 1 Male
## 16 Turkmenistan 2013 1 Male
## 17 Ukraine 2013 1 Male
## 18 Uzbekistan 2013 1 Male
## 19 Armenia 2011 1 Female
## 20 Azerbaijan 2010 1 Female
## 21 Belarus 2009 1 Female
## 22 Cuba 2012 1 Female
## 23 Democratic People's Republic of Korea 2008 1 Female
## 24 Estonia 2011 1 Female
## 25 Georgia 2013 1 Female
## 26 Kazakhstan 2009 1 Female
## 27 Latvia 2011 1 Female
## 28 Lithuania 2011 1 Female
## 29 Palau 2013 1 Female
## 30 Poland 2013 1 Female
## 31 Russian Federation 2010 1 Female
## 32 Slovenia 2013 1 Female
## 33 Tajikistan 2013 1 Female
## 34 Turkmenistan 2013 1 Female
## 35 Ukraine 2013 1 Female
## 36 Uzbekistan 2013 1 Female
## Rate
## 1 1.00
## 2 1.00
## 3 1.00
## 4 1.00
## 5 1.00
## 6 1.00
## 7 1.00
## 8 1.00
## 9 1.00
## 10 1.00
## 11 0.99
## 12 1.00
## 13 1.00
## 14 1.00
## 15 1.00
## 16 1.00
## 17 1.00
## 18 1.00
## 19 1.00
## 20 1.00
## 21 0.99
## 22 1.00
## 23 1.00
## 24 1.00
## 25 1.00
## 26 1.00
## 27 1.00
## 28 1.00
## 29 1.00
## 30 1.00
## 31 1.00
## 32 1.00
## 33 1.00
## 34 1.00
## 35 1.00
## 36 0.99
literacy %>% top_n(-10, Total)
## Countries.and.areas Reference.year.s. Total Sex Rate
## 1 Benin 2006 0.29 Male 0.41
## 2 Burkina Faso 2007 0.29 Male 0.37
## 3 Guinea 2010 0.25 Male 0.37
## 4 Niger 2012 0.15 Male 0.23
## 5 South Sudan 2008 0.27 Male 0.35
## 6 Benin 2006 0.29 Female 0.18
## 7 Burkina Faso 2007 0.29 Female 0.22
## 8 Guinea 2010 0.25 Female 0.12
## 9 Niger 2012 0.15 Female 0.09
## 10 South Sudan 2008 0.27 Female 0.19
The ten countries with the lowest literacy rates, total and by sex, are generally located in Africa and have rates less than 0.30.
literacy %>% arrange(Sex, Rate) %>% group_by(Sex) %>% top_n(-10, Rate)
## # A tibble: 21 x 5
## # Groups: Sex [2]
## Countries.and.areas Reference.year.s. Total Sex Rate
## <fct> <fct> <dbl> <chr> <dbl>
## 1 Niger 2012 0.15 Female 0.09
## 2 Guinea 2010 0.25 Female 0.12
## 3 Afghanistan 2011 0.32 Female 0.18
## 4 Benin 2006 0.290 Female 0.18
## 5 South Sudan 2008 0.27 Female 0.19
## 6 Mali 2010 0.31 Female 0.2
## 7 Burkina Faso 2007 0.290 Female 0.22
## 8 Central African Republic 2010 0.37 Female 0.24
## 9 Liberia 2007 0.43 Female 0.27
## 10 Chad 2013 0.38 Female 0.290
## # ... with 11 more rows
I finish my analysis by testing the differences in literacy rate by sex. The distribution of differences between sexes skews right and does not appear normally distributed. After an [admittedly] quick check of whether the data meet the test assumptions, I apply a two-tailed paired Wilcoxon signed rank test given the shape of the rate distributions and within-country paired rates. The test returns a significant p-value, meaning that if the test assumptions are met, male literacy rates are different (greater) than female literacy rates.
male_literacy <- literacy$Rate[literacy$Sex == "Male"]
female_literacy <- literacy$Rate[literacy$Sex == "Female"]
diff_literacy <- male_literacy - female_literacy
hist(diff_literacy)
qqnorm(diff_literacy)
wilcox.test(male_literacy, female_literacy, alternative="two.sided", paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: male_literacy and female_literacy
## V = 6861.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0