Midlife Crisis, R Version

What the Heck is This?

This is my attempt to get started (not fully reproduce) the data acquisition, cleaning, and analysis of the data in the “midlife crisis” project presented during the TIER workshop on Nov. 2-3, 2018.

Get and Clean Data

WDI Dataset

The original World Development Indicators dataset is in an Excel file. I use readxl to read this data in, indicating that the data is in the first sheet.

library(readxl)
wdi <- read_xlsx("Original-Data/original-wdi.xlsx",1)

Let’s take a quick peek at wdi. We could just do this using wdi, but to get a “pretty print” version, we’re going to use kable, which comes from the knitr package.

library(knitr)
kable(wdi)

Country Name	Country Code	Series Name	Series Code	2002 [YR2002]
India	IND	GDP per capita (current US$)	NY.GDP.PCAP.CD	486.640454
India	IND	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	11.890161
Indonesia	IDN	GDP per capita (current US$)	NY.GDP.PCAP.CD	909.887330
Indonesia	IDN	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	7.257458
Russian Federation	RUS	GDP per capita (current US$)	NY.GDP.PCAP.CD	2375.157871
Russian Federation	RUS	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	17.953634
Pakistan	PAK	GDP per capita (current US$)	NY.GDP.PCAP.CD	483.031872
Pakistan	PAK	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	8.723921
China	CHN	GDP per capita (current US$)	NY.GDP.PCAP.CD	1135.447950
China	CHN	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	15.590028
Jordan	JOR	GDP per capita (current US$)	NY.GDP.PCAP.CD	1901.580379
Jordan	JOR	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	22.695573
United States	USA	GDP per capita (current US$)	NY.GDP.PCAP.CD	38175.376383
United States	USA	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	15.039799
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
NA	NA	NA	NA	NA
Data from database: WDI	NA	NA	NA	NA
Last Updated: 05/06/2014	NA	NA	NA	NA

Removing Rows

Whoops, we brought in some extraneous data we don’t care about (rows 15-19). Let’s remove those and look again:

wdi <- wdi[-c(15:19),]
kable(wdi)

Country Name	Country Code	Series Name	Series Code	2002 [YR2002]
India	IND	GDP per capita (current US$)	NY.GDP.PCAP.CD	486.640454
India	IND	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	11.890161
Indonesia	IDN	GDP per capita (current US$)	NY.GDP.PCAP.CD	909.887330
Indonesia	IDN	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	7.257458
Russian Federation	RUS	GDP per capita (current US$)	NY.GDP.PCAP.CD	2375.157871
Russian Federation	RUS	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	17.953634
Pakistan	PAK	GDP per capita (current US$)	NY.GDP.PCAP.CD	483.031872
Pakistan	PAK	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	8.723921
China	CHN	GDP per capita (current US$)	NY.GDP.PCAP.CD	1135.447950
China	CHN	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	15.590028
Jordan	JOR	GDP per capita (current US$)	NY.GDP.PCAP.CD	1901.580379
Jordan	JOR	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	22.695573
United States	USA	GDP per capita (current US$)	NY.GDP.PCAP.CD	38175.376383
United States	USA	General government final consumption expenditure (% of GDP)	NE.CON.GOVT.ZS	15.039799

That’s better!

Transforming Long to Wide

Also, this is interesting – we have “long”, or key-value data (each observation, or row, represents a combination of country and measure – or key –, and there’s one value associated with each row, in the “2002 [YR2002]” column).

But we want it to be “wide” (each observation is a country, and the columns are the various measures, with the value in the intersection).

The best way to handle this is with tidyr, which has some great reshaping commands.

We’ll use the tidyr command spread(). We’ll also, in the same command, remove “Series Name”, since we really only need the “Series Code” field – having both is a bit extraneous. We’re going to use the country code and name later, so we’ll leave both of those fields untouched.

library(tidyr)
library(dplyr)
wide_wdi <- spread(data = wdi %>% select(-`Series Name`), 
                   key = "Series Code",
                   value = "2002 [YR2002]")

Let’s take a peek at wide_wdi:

kable(wide_wdi)

Country Name	Country Code	NE.CON.GOVT.ZS	NY.GDP.PCAP.CD
China	CHN	15.590028	1135.4479
India	IND	11.890161	486.6405
Indonesia	IDN	7.257458	909.8873
Jordan	JOR	22.695573	1901.5804
Pakistan	PAK	8.723921	483.0319
Russian Federation	RUS	17.953634	2375.1579
United States	USA	15.039799	38175.3764

Renaming Variables

Our variables are named in kind of a clunky way. Let’s fix that!

names(wide_wdi)[3:4] <- c("Expenditure", "Income")

And peek again!

kable(wide_wdi)

Country Name	Country Code	Expenditure	Income
China	CHN	15.590028	1135.4479
India	IND	11.890161	486.6405
Indonesia	IDN	7.257458	909.8873
Jordan	JOR	22.695573	1901.5804
Pakistan	PAK	8.723921	483.0319
Russian Federation	RUS	17.953634	2375.1579
United States	USA	15.039799	38175.3764

Pew Data

The Pew data was saved as an SPSS .sav file format. We can read this via the foreign package.

library(foreign)
pew = read.spss("Original-Data/original-pew.sav", to.data.frame=TRUE)

The pew data is quite large! Let’s take a peek at the variables it includes. Since there are quite a few, I’ll just use head(), which shows the first handful of values. The default is 6, but I want a few more, so I specify 20.

head(names(pew), 20)

##  [1] "country"  "psraid"   "quest_id" "q1"       "q2"       "q3"      
##  [7] "q4"       "q5.1rec"  "q5.2rec"  "q5.3rec"  "q6a"      "q6b"     
## [13] "q6c"      "q7"       "q8.1rec"  "q8.2rec"  "q8.3rec"  "q9"      
## [19] "q10"      "q10chi"

Removing Columns

In the Pew data, we really only need three variables:

country
q2, which is a self-report of life satisfaction
q74, which is age

We’ll rename the two question variables to make them more intuitive.

The dplyr function transmute is perfect for creating a new, simplified dataset from an existing dataset.

library(dplyr)
pew <- pew %>% 
       transmute(country, satis = q2, age = q74)

We may still have some data quality issues. Let’s take a peek at the top of pew and also look specifically at the factor levels of satis and age.

head(pew)

##     country satis age
## 1 Argentina  Four  37
## 2 Argentina Three  64
## 3 Argentina  Nine  25
## 4 Argentina Eight  40
## 5 Argentina  Four  48
## 6 Argentina Eight  86

Ugh! The numbers are written out? This is an artifact of how we brought the data in from SPSS – it gave us labels. It’s handy for country names, but not for things like age.

We’ll have to fix that. Let’s look at the levels to see what’s up:

Working With Levels

levels(pew$satis)

##  [1] "Zero"                     "One"                     
##  [3] "Two"                      "Three"                   
##  [5] "Four"                     "Five"                    
##  [7] "Six"                      "Seven"                   
##  [9] "Eight"                    "Nine"                    
## [11] "Ten"                      "Don't know (DO NOT READ)"
## [13] "Refused (DO NOT READ)"

Notice that the level number for “Zero” is 1, and the level number for “One” is two, and so on. This will prove useful later.

levels(pew$age)

##  [1] "18"                       "19"                      
##  [3] "20"                       "21"                      
##  [5] "22"                       "23"                      
##  [7] "24"                       "25"                      
##  [9] "26"                       "27"                      
## [11] "28"                       "29"                      
## [13] "30"                       "31"                      
## [15] "32"                       "33"                      
## [17] "34"                       "35"                      
## [19] "36"                       "37"                      
## [21] "38"                       "39"                      
## [23] "40"                       "41"                      
## [25] "42"                       "43"                      
## [27] "44"                       "45"                      
## [29] "46"                       "47"                      
## [31] "48"                       "49"                      
## [33] "50"                       "51"                      
## [35] "52"                       "53"                      
## [37] "54"                       "55"                      
## [39] "56"                       "57"                      
## [41] "58"                       "59"                      
## [43] "60"                       "61"                      
## [45] "62"                       "63"                      
## [47] "64"                       "65"                      
## [49] "66"                       "67"                      
## [51] "68"                       "69"                      
## [53] "70"                       "71"                      
## [55] "72"                       "73"                      
## [57] "74"                       "75"                      
## [59] "76"                       "77"                      
## [61] "78"                       "79"                      
## [63] "80"                       "81"                      
## [65] "82"                       "83"                      
## [67] "84"                       "85"                      
## [69] "86"                       "87"                      
## [71] "88"                       "89"                      
## [73] "90"                       "91"                      
## [75] "92"                       "93"                      
## [77] "94"                       "95"                      
## [79] "97 or older"              "Don't know (DO NOT READ)"
## [81] "Refused (DO NOT READ)"

Replacing Values

OK, so the first thing we can do is get out the non-responders and imprecise age, and give them an “NA”.

pew$satis[which(pew$satis == "Don't know (DO NOT READ)")] <- NA
pew$satis[which(pew$satis == "Refused (DO NOT READ)")] <- NA

pew$age[which(pew$age == "Refused (DO NOT READ)")] <- NA
pew$age[which(pew$age == "Refused (DO NOT READ)")] <- NA
pew$age[which(pew$age == "97 or older")] <- NA

For satis, we noticed that the level numbers were one more than the actual labels, so we can easily convert written out labels to numbers:

pew$satis <- as.numeric(pew$satis)-1

For age, we have to get the labels themselves, NOT the level numbers. This is confusing! The labels are character, not numeric, so, for example, “18”, not 18.

We have to do a double transformation: first, get the labels themselves by using as.character(), then convert those character data to numeric using as.numeric():

pew$age <- as.numeric(as.character(pew$age))

Finally, the country variable in pew is a factor variable, and we want it to be a character string (in order to merge easily). That’s an easy fix:

pew$country <- as.character(pew$country)

Let’s peek at the cleaned up Pew data:

head(pew)

##     country satis age
## 1 Argentina     4  37
## 2 Argentina     3  64
## 3 Argentina     9  25
## 4 Argentina     8  40
## 5 Argentina     4  48
## 6 Argentina     8  86

We’re going to remove the youngest and oldest folks (not sure of the scientific reasoning for this):

pew <- pew %>% filter(age >= 21 & age <= 70 )

Aggregate Data

We want to aggregate country data (squish down all of a given country’s data into a single row). dplyr has a function, summarise (or summarize) which does this nicely.

pew <- pew %>% 
       group_by(country) %>%
       summarise(satis = mean(satis, na.rm = TRUE), 
                 age=mean(age, na.rm = TRUE),
                 count = n())

What does pew look like now?

kable(pew)

country	satis	age	count
Angola	3.946735	31.04034	595
Argentina	5.906205	41.83764	696
Bangladesh	5.231250	35.63876	645
Bolivia	5.448737	36.05465	677
Brazil	6.099282	40.03444	842
Bulgaria	3.797066	45.48434	415
Canada	6.945498	44.69104	424
China	5.174480	41.26559	2839
Czech Republic	5.909739	44.63420	421
Egypt	6.369645	35.20510	824
France	6.513189	44.21531	418
Germany	6.377886	45.82367	828
Ghana	4.467890	36.90305	557
Great Britain	6.401392	44.61663	433
Guatemala	7.324582	38.09739	421
Honduras	6.839449	36.96339	437
India	5.085851	37.00260	1922
Indonesia	5.761539	37.32747	910
Italy	6.381899	42.71082	453
Ivory Coast	5.315790	30.75212	589
Japan	5.948854	50.98589	567
Jordan	5.343716	36.84227	932
Kenya	4.958549	37.34364	582
Lebanon	4.924331	38.06279	860
Mali	4.461279	37.67395	595
Mexico	6.781818	39.63464	895
Nigeria	5.869159	36.67749	862
Pakistan	5.472683	36.19854	1642
Peru	5.682927	38.25563	622
Philippines	5.596774	39.88818	626
Poland	5.374384	43.78846	416
Russia	4.741266	45.23326	926
Senegal	5.470016	35.81524	617
Slovak Republic	5.392523	42.58178	428
South Africa	5.160804	37.76047	597
South Korea	6.443251	39.94018	652
Tanzania	3.907051	39.84494	632
Turkey	4.559382	38.31271	873
Uganda	3.987194	35.32908	863
Ukraine	4.808126	43.73138	443
US	7.019277	44.18283	1258
Uzbekistan	5.850921	37.50334	598
Venezuela	6.292308	38.56899	587
Vietnam	6.375527	38.03376	711

We want to remove countries with a small number of measurements.

pew <- pew %>% filter(count >= 900)

Data Harmonization

Let’s make sure our country names match (so we can merge effectively):

unique(wide_wdi$`Country Name`)

## [1] "China"              "India"              "Indonesia"         
## [4] "Jordan"             "Pakistan"           "Russian Federation"
## [7] "United States"

unique(pew$country)

## [1] "China"     "India"     "Indonesia" "Jordan"    "Pakistan"  "Russia"   
## [7] "US"

We have a bit of a disparity in how the USA and Russia are coded. Let’s fix that in pew so that it matches wide_wdi.

pew$country[which(pew$country == "US")] <- "United States"
pew$country[which(pew$country == "Russia")] <- "Russian Federation"

Merge Data

We can merge on just the overlap of our data sets (that’s the default method). Because our “hinge” has different names in the two datasets, we’ll specify an “x” and a “y” data frame and indicate the name of the “hinge” in each one.

merged <- merge(x=wide_wdi, 
                y=pew, 
                by.x = "Country Name",
                by.y = "country")

Let’s look at the merged data set:

kable(head(merged))

Country Name	Country Code	Expenditure	Income	satis	age	count
China	CHN	15.590028	1135.4479	5.174480	41.26559	2839
India	IND	11.890161	486.6405	5.085851	37.00260	1922
Indonesia	IDN	7.257458	909.8873	5.761539	37.32747	910
Jordan	JOR	22.695573	1901.5804	5.343716	36.84227	932
Pakistan	PAK	8.723921	483.0319	5.472683	36.19854	1642
Russian Federation	RUS	17.953634	2375.1579	4.741266	45.23326	926