This is my attempt to get started (not fully reproduce) the data acquisition, cleaning, and analysis of the data in the “midlife crisis” project presented during the TIER workshop on Nov. 2-3, 2018.
The original World Development Indicators dataset is in an Excel file. I use readxl to read this data in, indicating that the data is in the first sheet.
library(readxl)
wdi <- read_xlsx("Original-Data/original-wdi.xlsx",1)Let’s take a quick peek at wdi. We could just do this using wdi, but to get a “pretty print” version, we’re going to use kable, which comes from the knitr package.
library(knitr)
kable(wdi)| Country Name | Country Code | Series Name | Series Code | 2002 [YR2002] |
|---|---|---|---|---|
| India | IND | GDP per capita (current US$) | NY.GDP.PCAP.CD | 486.640454 |
| India | IND | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 11.890161 |
| Indonesia | IDN | GDP per capita (current US$) | NY.GDP.PCAP.CD | 909.887330 |
| Indonesia | IDN | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 7.257458 |
| Russian Federation | RUS | GDP per capita (current US$) | NY.GDP.PCAP.CD | 2375.157871 |
| Russian Federation | RUS | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 17.953634 |
| Pakistan | PAK | GDP per capita (current US$) | NY.GDP.PCAP.CD | 483.031872 |
| Pakistan | PAK | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 8.723921 |
| China | CHN | GDP per capita (current US$) | NY.GDP.PCAP.CD | 1135.447950 |
| China | CHN | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 15.590028 |
| Jordan | JOR | GDP per capita (current US$) | NY.GDP.PCAP.CD | 1901.580379 |
| Jordan | JOR | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 22.695573 |
| United States | USA | GDP per capita (current US$) | NY.GDP.PCAP.CD | 38175.376383 |
| United States | USA | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 15.039799 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| Data from database: WDI | NA | NA | NA | NA |
| Last Updated: 05/06/2014 | NA | NA | NA | NA |
Whoops, we brought in some extraneous data we don’t care about (rows 15-19). Let’s remove those and look again:
wdi <- wdi[-c(15:19),]
kable(wdi)| Country Name | Country Code | Series Name | Series Code | 2002 [YR2002] |
|---|---|---|---|---|
| India | IND | GDP per capita (current US$) | NY.GDP.PCAP.CD | 486.640454 |
| India | IND | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 11.890161 |
| Indonesia | IDN | GDP per capita (current US$) | NY.GDP.PCAP.CD | 909.887330 |
| Indonesia | IDN | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 7.257458 |
| Russian Federation | RUS | GDP per capita (current US$) | NY.GDP.PCAP.CD | 2375.157871 |
| Russian Federation | RUS | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 17.953634 |
| Pakistan | PAK | GDP per capita (current US$) | NY.GDP.PCAP.CD | 483.031872 |
| Pakistan | PAK | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 8.723921 |
| China | CHN | GDP per capita (current US$) | NY.GDP.PCAP.CD | 1135.447950 |
| China | CHN | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 15.590028 |
| Jordan | JOR | GDP per capita (current US$) | NY.GDP.PCAP.CD | 1901.580379 |
| Jordan | JOR | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 22.695573 |
| United States | USA | GDP per capita (current US$) | NY.GDP.PCAP.CD | 38175.376383 |
| United States | USA | General government final consumption expenditure (% of GDP) | NE.CON.GOVT.ZS | 15.039799 |
That’s better!
Also, this is interesting – we have “long”, or key-value data (each observation, or row, represents a combination of country and measure – or key –, and there’s one value associated with each row, in the “2002 [YR2002]” column).
But we want it to be “wide” (each observation is a country, and the columns are the various measures, with the value in the intersection).
The best way to handle this is with tidyr, which has some great reshaping commands.
We’ll use the tidyr command spread(). We’ll also, in the same command, remove “Series Name”, since we really only need the “Series Code” field – having both is a bit extraneous. We’re going to use the country code and name later, so we’ll leave both of those fields untouched.
library(tidyr)
library(dplyr)
wide_wdi <- spread(data = wdi %>% select(-`Series Name`),
key = "Series Code",
value = "2002 [YR2002]")Let’s take a peek at wide_wdi:
kable(wide_wdi)| Country Name | Country Code | NE.CON.GOVT.ZS | NY.GDP.PCAP.CD |
|---|---|---|---|
| China | CHN | 15.590028 | 1135.4479 |
| India | IND | 11.890161 | 486.6405 |
| Indonesia | IDN | 7.257458 | 909.8873 |
| Jordan | JOR | 22.695573 | 1901.5804 |
| Pakistan | PAK | 8.723921 | 483.0319 |
| Russian Federation | RUS | 17.953634 | 2375.1579 |
| United States | USA | 15.039799 | 38175.3764 |
Our variables are named in kind of a clunky way. Let’s fix that!
names(wide_wdi)[3:4] <- c("Expenditure", "Income")And peek again!
kable(wide_wdi)| Country Name | Country Code | Expenditure | Income |
|---|---|---|---|
| China | CHN | 15.590028 | 1135.4479 |
| India | IND | 11.890161 | 486.6405 |
| Indonesia | IDN | 7.257458 | 909.8873 |
| Jordan | JOR | 22.695573 | 1901.5804 |
| Pakistan | PAK | 8.723921 | 483.0319 |
| Russian Federation | RUS | 17.953634 | 2375.1579 |
| United States | USA | 15.039799 | 38175.3764 |
The Pew data was saved as an SPSS .sav file format. We can read this via the foreign package.
library(foreign)
pew = read.spss("Original-Data/original-pew.sav", to.data.frame=TRUE)The pew data is quite large! Let’s take a peek at the variables it includes. Since there are quite a few, I’ll just use head(), which shows the first handful of values. The default is 6, but I want a few more, so I specify 20.
head(names(pew), 20)## [1] "country" "psraid" "quest_id" "q1" "q2" "q3"
## [7] "q4" "q5.1rec" "q5.2rec" "q5.3rec" "q6a" "q6b"
## [13] "q6c" "q7" "q8.1rec" "q8.2rec" "q8.3rec" "q9"
## [19] "q10" "q10chi"
In the Pew data, we really only need three variables:
We’ll rename the two question variables to make them more intuitive.
The dplyr function transmute is perfect for creating a new, simplified dataset from an existing dataset.
library(dplyr)
pew <- pew %>%
transmute(country, satis = q2, age = q74)We may still have some data quality issues. Let’s take a peek at the top of pew and also look specifically at the factor levels of satis and age.
head(pew)## country satis age
## 1 Argentina Four 37
## 2 Argentina Three 64
## 3 Argentina Nine 25
## 4 Argentina Eight 40
## 5 Argentina Four 48
## 6 Argentina Eight 86
Ugh! The numbers are written out? This is an artifact of how we brought the data in from SPSS – it gave us labels. It’s handy for country names, but not for things like age.
We’ll have to fix that. Let’s look at the levels to see what’s up:
levels(pew$satis)## [1] "Zero" "One"
## [3] "Two" "Three"
## [5] "Four" "Five"
## [7] "Six" "Seven"
## [9] "Eight" "Nine"
## [11] "Ten" "Don't know (DO NOT READ)"
## [13] "Refused (DO NOT READ)"
Notice that the level number for “Zero” is 1, and the level number for “One” is two, and so on. This will prove useful later.
levels(pew$age)## [1] "18" "19"
## [3] "20" "21"
## [5] "22" "23"
## [7] "24" "25"
## [9] "26" "27"
## [11] "28" "29"
## [13] "30" "31"
## [15] "32" "33"
## [17] "34" "35"
## [19] "36" "37"
## [21] "38" "39"
## [23] "40" "41"
## [25] "42" "43"
## [27] "44" "45"
## [29] "46" "47"
## [31] "48" "49"
## [33] "50" "51"
## [35] "52" "53"
## [37] "54" "55"
## [39] "56" "57"
## [41] "58" "59"
## [43] "60" "61"
## [45] "62" "63"
## [47] "64" "65"
## [49] "66" "67"
## [51] "68" "69"
## [53] "70" "71"
## [55] "72" "73"
## [57] "74" "75"
## [59] "76" "77"
## [61] "78" "79"
## [63] "80" "81"
## [65] "82" "83"
## [67] "84" "85"
## [69] "86" "87"
## [71] "88" "89"
## [73] "90" "91"
## [75] "92" "93"
## [77] "94" "95"
## [79] "97 or older" "Don't know (DO NOT READ)"
## [81] "Refused (DO NOT READ)"
OK, so the first thing we can do is get out the non-responders and imprecise age, and give them an “NA”.
pew$satis[which(pew$satis == "Don't know (DO NOT READ)")] <- NA
pew$satis[which(pew$satis == "Refused (DO NOT READ)")] <- NA
pew$age[which(pew$age == "Refused (DO NOT READ)")] <- NA
pew$age[which(pew$age == "Refused (DO NOT READ)")] <- NA
pew$age[which(pew$age == "97 or older")] <- NAFor satis, we noticed that the level numbers were one more than the actual labels, so we can easily convert written out labels to numbers:
pew$satis <- as.numeric(pew$satis)-1For age, we have to get the labels themselves, NOT the level numbers. This is confusing! The labels are character, not numeric, so, for example, “18”, not 18.
We have to do a double transformation: first, get the labels themselves by using as.character(), then convert those character data to numeric using as.numeric():
pew$age <- as.numeric(as.character(pew$age))Finally, the country variable in pew is a factor variable, and we want it to be a character string (in order to merge easily). That’s an easy fix:
pew$country <- as.character(pew$country)Let’s peek at the cleaned up Pew data:
head(pew)## country satis age
## 1 Argentina 4 37
## 2 Argentina 3 64
## 3 Argentina 9 25
## 4 Argentina 8 40
## 5 Argentina 4 48
## 6 Argentina 8 86
We’re going to remove the youngest and oldest folks (not sure of the scientific reasoning for this):
pew <- pew %>% filter(age >= 21 & age <= 70 )We want to aggregate country data (squish down all of a given country’s data into a single row). dplyr has a function, summarise (or summarize) which does this nicely.
pew <- pew %>%
group_by(country) %>%
summarise(satis = mean(satis, na.rm = TRUE),
age=mean(age, na.rm = TRUE),
count = n())What does pew look like now?
kable(pew)| country | satis | age | count |
|---|---|---|---|
| Angola | 3.946735 | 31.04034 | 595 |
| Argentina | 5.906205 | 41.83764 | 696 |
| Bangladesh | 5.231250 | 35.63876 | 645 |
| Bolivia | 5.448737 | 36.05465 | 677 |
| Brazil | 6.099282 | 40.03444 | 842 |
| Bulgaria | 3.797066 | 45.48434 | 415 |
| Canada | 6.945498 | 44.69104 | 424 |
| China | 5.174480 | 41.26559 | 2839 |
| Czech Republic | 5.909739 | 44.63420 | 421 |
| Egypt | 6.369645 | 35.20510 | 824 |
| France | 6.513189 | 44.21531 | 418 |
| Germany | 6.377886 | 45.82367 | 828 |
| Ghana | 4.467890 | 36.90305 | 557 |
| Great Britain | 6.401392 | 44.61663 | 433 |
| Guatemala | 7.324582 | 38.09739 | 421 |
| Honduras | 6.839449 | 36.96339 | 437 |
| India | 5.085851 | 37.00260 | 1922 |
| Indonesia | 5.761539 | 37.32747 | 910 |
| Italy | 6.381899 | 42.71082 | 453 |
| Ivory Coast | 5.315790 | 30.75212 | 589 |
| Japan | 5.948854 | 50.98589 | 567 |
| Jordan | 5.343716 | 36.84227 | 932 |
| Kenya | 4.958549 | 37.34364 | 582 |
| Lebanon | 4.924331 | 38.06279 | 860 |
| Mali | 4.461279 | 37.67395 | 595 |
| Mexico | 6.781818 | 39.63464 | 895 |
| Nigeria | 5.869159 | 36.67749 | 862 |
| Pakistan | 5.472683 | 36.19854 | 1642 |
| Peru | 5.682927 | 38.25563 | 622 |
| Philippines | 5.596774 | 39.88818 | 626 |
| Poland | 5.374384 | 43.78846 | 416 |
| Russia | 4.741266 | 45.23326 | 926 |
| Senegal | 5.470016 | 35.81524 | 617 |
| Slovak Republic | 5.392523 | 42.58178 | 428 |
| South Africa | 5.160804 | 37.76047 | 597 |
| South Korea | 6.443251 | 39.94018 | 652 |
| Tanzania | 3.907051 | 39.84494 | 632 |
| Turkey | 4.559382 | 38.31271 | 873 |
| Uganda | 3.987194 | 35.32908 | 863 |
| Ukraine | 4.808126 | 43.73138 | 443 |
| US | 7.019277 | 44.18283 | 1258 |
| Uzbekistan | 5.850921 | 37.50334 | 598 |
| Venezuela | 6.292308 | 38.56899 | 587 |
| Vietnam | 6.375527 | 38.03376 | 711 |
We want to remove countries with a small number of measurements.
pew <- pew %>% filter(count >= 900)Let’s make sure our country names match (so we can merge effectively):
unique(wide_wdi$`Country Name`)## [1] "China" "India" "Indonesia"
## [4] "Jordan" "Pakistan" "Russian Federation"
## [7] "United States"
unique(pew$country)## [1] "China" "India" "Indonesia" "Jordan" "Pakistan" "Russia"
## [7] "US"
We have a bit of a disparity in how the USA and Russia are coded. Let’s fix that in pew so that it matches wide_wdi.
pew$country[which(pew$country == "US")] <- "United States"
pew$country[which(pew$country == "Russia")] <- "Russian Federation"We can merge on just the overlap of our data sets (that’s the default method). Because our “hinge” has different names in the two datasets, we’ll specify an “x” and a “y” data frame and indicate the name of the “hinge” in each one.
merged <- merge(x=wide_wdi,
y=pew,
by.x = "Country Name",
by.y = "country")Let’s look at the merged data set:
kable(head(merged))| Country Name | Country Code | Expenditure | Income | satis | age | count |
|---|---|---|---|---|---|---|
| China | CHN | 15.590028 | 1135.4479 | 5.174480 | 41.26559 | 2839 |
| India | IND | 11.890161 | 486.6405 | 5.085851 | 37.00260 | 1922 |
| Indonesia | IDN | 7.257458 | 909.8873 | 5.761539 | 37.32747 | 910 |
| Jordan | JOR | 22.695573 | 1901.5804 | 5.343716 | 36.84227 | 932 |
| Pakistan | PAK | 8.723921 | 483.0319 | 5.472683 | 36.19854 | 1642 |
| Russian Federation | RUS | 17.953634 | 2375.1579 | 4.741266 | 45.23326 | 926 |