What the Heck is This?

This is my attempt to get started (not fully reproduce) the data acquisition, cleaning, and analysis of the data in the “midlife crisis” project presented during the TIER workshop on Nov. 2-3, 2018.

Get and Clean Data

WDI Dataset

The original World Development Indicators dataset is in an Excel file. I use readxl to read this data in, indicating that the data is in the first sheet.

library(readxl)
wdi <- read_xlsx("Original-Data/original-wdi.xlsx",1)

Let’s take a quick peek at wdi. We could just do this using wdi, but to get a “pretty print” version, we’re going to use kable, which comes from the knitr package.

library(knitr)
kable(wdi)
Country Name Country Code Series Name Series Code 2002 [YR2002]
India IND GDP per capita (current US$) NY.GDP.PCAP.CD 486.640454
India IND General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 11.890161
Indonesia IDN GDP per capita (current US$) NY.GDP.PCAP.CD 909.887330
Indonesia IDN General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 7.257458
Russian Federation RUS GDP per capita (current US$) NY.GDP.PCAP.CD 2375.157871
Russian Federation RUS General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 17.953634
Pakistan PAK GDP per capita (current US$) NY.GDP.PCAP.CD 483.031872
Pakistan PAK General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 8.723921
China CHN GDP per capita (current US$) NY.GDP.PCAP.CD 1135.447950
China CHN General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 15.590028
Jordan JOR GDP per capita (current US$) NY.GDP.PCAP.CD 1901.580379
Jordan JOR General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 22.695573
United States USA GDP per capita (current US$) NY.GDP.PCAP.CD 38175.376383
United States USA General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 15.039799
NA NA NA NA NA
NA NA NA NA NA
NA NA NA NA NA
Data from database: WDI NA NA NA NA
Last Updated: 05/06/2014 NA NA NA NA

Removing Rows

Whoops, we brought in some extraneous data we don’t care about (rows 15-19). Let’s remove those and look again:

wdi <- wdi[-c(15:19),]
kable(wdi)
Country Name Country Code Series Name Series Code 2002 [YR2002]
India IND GDP per capita (current US$) NY.GDP.PCAP.CD 486.640454
India IND General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 11.890161
Indonesia IDN GDP per capita (current US$) NY.GDP.PCAP.CD 909.887330
Indonesia IDN General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 7.257458
Russian Federation RUS GDP per capita (current US$) NY.GDP.PCAP.CD 2375.157871
Russian Federation RUS General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 17.953634
Pakistan PAK GDP per capita (current US$) NY.GDP.PCAP.CD 483.031872
Pakistan PAK General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 8.723921
China CHN GDP per capita (current US$) NY.GDP.PCAP.CD 1135.447950
China CHN General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 15.590028
Jordan JOR GDP per capita (current US$) NY.GDP.PCAP.CD 1901.580379
Jordan JOR General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 22.695573
United States USA GDP per capita (current US$) NY.GDP.PCAP.CD 38175.376383
United States USA General government final consumption expenditure (% of GDP) NE.CON.GOVT.ZS 15.039799

That’s better!

Transforming Long to Wide

Also, this is interesting – we have “long”, or key-value data (each observation, or row, represents a combination of country and measure – or key –, and there’s one value associated with each row, in the “2002 [YR2002]” column).

But we want it to be “wide” (each observation is a country, and the columns are the various measures, with the value in the intersection).

The best way to handle this is with tidyr, which has some great reshaping commands.

We’ll use the tidyr command spread(). We’ll also, in the same command, remove “Series Name”, since we really only need the “Series Code” field – having both is a bit extraneous. We’re going to use the country code and name later, so we’ll leave both of those fields untouched.

library(tidyr)
library(dplyr)
wide_wdi <- spread(data = wdi %>% select(-`Series Name`), 
                   key = "Series Code",
                   value = "2002 [YR2002]")

Let’s take a peek at wide_wdi:

kable(wide_wdi)
Country Name Country Code NE.CON.GOVT.ZS NY.GDP.PCAP.CD
China CHN 15.590028 1135.4479
India IND 11.890161 486.6405
Indonesia IDN 7.257458 909.8873
Jordan JOR 22.695573 1901.5804
Pakistan PAK 8.723921 483.0319
Russian Federation RUS 17.953634 2375.1579
United States USA 15.039799 38175.3764

Renaming Variables

Our variables are named in kind of a clunky way. Let’s fix that!

names(wide_wdi)[3:4] <- c("Expenditure", "Income")

And peek again!

kable(wide_wdi)
Country Name Country Code Expenditure Income
China CHN 15.590028 1135.4479
India IND 11.890161 486.6405
Indonesia IDN 7.257458 909.8873
Jordan JOR 22.695573 1901.5804
Pakistan PAK 8.723921 483.0319
Russian Federation RUS 17.953634 2375.1579
United States USA 15.039799 38175.3764

Pew Data

The Pew data was saved as an SPSS .sav file format. We can read this via the foreign package.

library(foreign)
pew = read.spss("Original-Data/original-pew.sav", to.data.frame=TRUE)

The pew data is quite large! Let’s take a peek at the variables it includes. Since there are quite a few, I’ll just use head(), which shows the first handful of values. The default is 6, but I want a few more, so I specify 20.

head(names(pew), 20)
##  [1] "country"  "psraid"   "quest_id" "q1"       "q2"       "q3"      
##  [7] "q4"       "q5.1rec"  "q5.2rec"  "q5.3rec"  "q6a"      "q6b"     
## [13] "q6c"      "q7"       "q8.1rec"  "q8.2rec"  "q8.3rec"  "q9"      
## [19] "q10"      "q10chi"

Removing Columns

In the Pew data, we really only need three variables:

  • country
  • q2, which is a self-report of life satisfaction
  • q74, which is age

We’ll rename the two question variables to make them more intuitive.

The dplyr function transmute is perfect for creating a new, simplified dataset from an existing dataset.

library(dplyr)
pew <- pew %>% 
       transmute(country, satis = q2, age = q74)

We may still have some data quality issues. Let’s take a peek at the top of pew and also look specifically at the factor levels of satis and age.

head(pew)
##     country satis age
## 1 Argentina  Four  37
## 2 Argentina Three  64
## 3 Argentina  Nine  25
## 4 Argentina Eight  40
## 5 Argentina  Four  48
## 6 Argentina Eight  86

Ugh! The numbers are written out? This is an artifact of how we brought the data in from SPSS – it gave us labels. It’s handy for country names, but not for things like age.

We’ll have to fix that. Let’s look at the levels to see what’s up:

Working With Levels

levels(pew$satis)
##  [1] "Zero"                     "One"                     
##  [3] "Two"                      "Three"                   
##  [5] "Four"                     "Five"                    
##  [7] "Six"                      "Seven"                   
##  [9] "Eight"                    "Nine"                    
## [11] "Ten"                      "Don't know (DO NOT READ)"
## [13] "Refused (DO NOT READ)"

Notice that the level number for “Zero” is 1, and the level number for “One” is two, and so on. This will prove useful later.

levels(pew$age)
##  [1] "18"                       "19"                      
##  [3] "20"                       "21"                      
##  [5] "22"                       "23"                      
##  [7] "24"                       "25"                      
##  [9] "26"                       "27"                      
## [11] "28"                       "29"                      
## [13] "30"                       "31"                      
## [15] "32"                       "33"                      
## [17] "34"                       "35"                      
## [19] "36"                       "37"                      
## [21] "38"                       "39"                      
## [23] "40"                       "41"                      
## [25] "42"                       "43"                      
## [27] "44"                       "45"                      
## [29] "46"                       "47"                      
## [31] "48"                       "49"                      
## [33] "50"                       "51"                      
## [35] "52"                       "53"                      
## [37] "54"                       "55"                      
## [39] "56"                       "57"                      
## [41] "58"                       "59"                      
## [43] "60"                       "61"                      
## [45] "62"                       "63"                      
## [47] "64"                       "65"                      
## [49] "66"                       "67"                      
## [51] "68"                       "69"                      
## [53] "70"                       "71"                      
## [55] "72"                       "73"                      
## [57] "74"                       "75"                      
## [59] "76"                       "77"                      
## [61] "78"                       "79"                      
## [63] "80"                       "81"                      
## [65] "82"                       "83"                      
## [67] "84"                       "85"                      
## [69] "86"                       "87"                      
## [71] "88"                       "89"                      
## [73] "90"                       "91"                      
## [75] "92"                       "93"                      
## [77] "94"                       "95"                      
## [79] "97 or older"              "Don't know (DO NOT READ)"
## [81] "Refused (DO NOT READ)"

Replacing Values

OK, so the first thing we can do is get out the non-responders and imprecise age, and give them an “NA”.

pew$satis[which(pew$satis == "Don't know (DO NOT READ)")] <- NA
pew$satis[which(pew$satis == "Refused (DO NOT READ)")] <- NA

pew$age[which(pew$age == "Refused (DO NOT READ)")] <- NA
pew$age[which(pew$age == "Refused (DO NOT READ)")] <- NA
pew$age[which(pew$age == "97 or older")] <- NA

For satis, we noticed that the level numbers were one more than the actual labels, so we can easily convert written out labels to numbers:

pew$satis <- as.numeric(pew$satis)-1

For age, we have to get the labels themselves, NOT the level numbers. This is confusing! The labels are character, not numeric, so, for example, “18”, not 18.

We have to do a double transformation: first, get the labels themselves by using as.character(), then convert those character data to numeric using as.numeric():

pew$age <- as.numeric(as.character(pew$age))

Finally, the country variable in pew is a factor variable, and we want it to be a character string (in order to merge easily). That’s an easy fix:

pew$country <- as.character(pew$country)

Let’s peek at the cleaned up Pew data:

head(pew)
##     country satis age
## 1 Argentina     4  37
## 2 Argentina     3  64
## 3 Argentina     9  25
## 4 Argentina     8  40
## 5 Argentina     4  48
## 6 Argentina     8  86

We’re going to remove the youngest and oldest folks (not sure of the scientific reasoning for this):

pew <- pew %>% filter(age >= 21 & age <= 70 )

Aggregate Data

We want to aggregate country data (squish down all of a given country’s data into a single row). dplyr has a function, summarise (or summarize) which does this nicely.

pew <- pew %>% 
       group_by(country) %>%
       summarise(satis = mean(satis, na.rm = TRUE), 
                 age=mean(age, na.rm = TRUE),
                 count = n())

What does pew look like now?

kable(pew)
country satis age count
Angola 3.946735 31.04034 595
Argentina 5.906205 41.83764 696
Bangladesh 5.231250 35.63876 645
Bolivia 5.448737 36.05465 677
Brazil 6.099282 40.03444 842
Bulgaria 3.797066 45.48434 415
Canada 6.945498 44.69104 424
China 5.174480 41.26559 2839
Czech Republic 5.909739 44.63420 421
Egypt 6.369645 35.20510 824
France 6.513189 44.21531 418
Germany 6.377886 45.82367 828
Ghana 4.467890 36.90305 557
Great Britain 6.401392 44.61663 433
Guatemala 7.324582 38.09739 421
Honduras 6.839449 36.96339 437
India 5.085851 37.00260 1922
Indonesia 5.761539 37.32747 910
Italy 6.381899 42.71082 453
Ivory Coast 5.315790 30.75212 589
Japan 5.948854 50.98589 567
Jordan 5.343716 36.84227 932
Kenya 4.958549 37.34364 582
Lebanon 4.924331 38.06279 860
Mali 4.461279 37.67395 595
Mexico 6.781818 39.63464 895
Nigeria 5.869159 36.67749 862
Pakistan 5.472683 36.19854 1642
Peru 5.682927 38.25563 622
Philippines 5.596774 39.88818 626
Poland 5.374384 43.78846 416
Russia 4.741266 45.23326 926
Senegal 5.470016 35.81524 617
Slovak Republic 5.392523 42.58178 428
South Africa 5.160804 37.76047 597
South Korea 6.443251 39.94018 652
Tanzania 3.907051 39.84494 632
Turkey 4.559382 38.31271 873
Uganda 3.987194 35.32908 863
Ukraine 4.808126 43.73138 443
US 7.019277 44.18283 1258
Uzbekistan 5.850921 37.50334 598
Venezuela 6.292308 38.56899 587
Vietnam 6.375527 38.03376 711

We want to remove countries with a small number of measurements.

pew <- pew %>% filter(count >= 900)

Data Harmonization

Let’s make sure our country names match (so we can merge effectively):

unique(wide_wdi$`Country Name`)
## [1] "China"              "India"              "Indonesia"         
## [4] "Jordan"             "Pakistan"           "Russian Federation"
## [7] "United States"
unique(pew$country)
## [1] "China"     "India"     "Indonesia" "Jordan"    "Pakistan"  "Russia"   
## [7] "US"

We have a bit of a disparity in how the USA and Russia are coded. Let’s fix that in pew so that it matches wide_wdi.

pew$country[which(pew$country == "US")] <- "United States"
pew$country[which(pew$country == "Russia")] <- "Russian Federation"

Merge Data

We can merge on just the overlap of our data sets (that’s the default method). Because our “hinge” has different names in the two datasets, we’ll specify an “x” and a “y” data frame and indicate the name of the “hinge” in each one.

merged <- merge(x=wide_wdi, 
                y=pew, 
                by.x = "Country Name",
                by.y = "country")

Let’s look at the merged data set:

kable(head(merged))
Country Name Country Code Expenditure Income satis age count
China CHN 15.590028 1135.4479 5.174480 41.26559 2839
India IND 11.890161 486.6405 5.085851 37.00260 1922
Indonesia IDN 7.257458 909.8873 5.761539 37.32747 910
Jordan JOR 22.695573 1901.5804 5.343716 36.84227 932
Pakistan PAK 8.723921 483.0319 5.472683 36.19854 1642
Russian Federation RUS 17.953634 2375.1579 4.741266 45.23326 926