8 June 2016

Today's agenda

There are lots of interesting questions that we can answer using freely-available data. We're going to watch a video that suggests such a question. We'll then develop an R script that builds a figure from this data.

To do this, we're going to have to

  • formulate a question
  • figure out what data we need
  • get the data
  • put the data into a form we can analyze
  • read the data into R and perform some analyses
  • plot up the data
  • derive insights from the plot

A video featuring Swedish economist Hans Rosling

What is the world distribution of per capita income, in $/day?

How many people in the world live on

  • $3/day or less? (~$1/day)
  • between $3/day and $30/day? (~$10/day)
  • between $30/day and $300/day? (~$100/day)

What kind of data do we need to answer this question?

  • something to do with income
  • some measure of population

Most data are aggregated by country. One measure of income is gross domestic product (GDP). So, we'll use

  • population by country
  • GDP by country

Using these data to answer our question has some obvious problems, but we'll ignore them for now.

Where to find the data

Check out http://data.worldbank.org/ and search for

  • GDP per capita (choose "Current US dollars")
  • Population (choose "total")

Download the data as .csv files to a place where you can find them.

Wrangling the data

  • Unzip the data files
  • Open up the first .csv file in each unzipped folder
  • Inspect the files – what do they tell us about their contents?
  • In each open file, delete all the columns except for the country names and the data for 2005
  • Make a new Excel workbook and paste the remaining information from each data file into the new workbook
  • Check the country names – do they line up?
  • Delete the extra country names column
  • Save the new workbook, also as a .csv file

Piece 1

# Clean up the workspace.  
rm(list = ls())
graphics.off()

# Go get some necessary R packages.  
install.packages("magicaxis", "dplyr")
library(magicaxis)
library(dplyr)

Piece 2

# Read in the cleaned-up data, set
# the column names to be something
# sensible, and look at the data.  
country.data <- read.csv("combined_gdp_per_capita_population.csv", 
                         skip = 4, header = TRUE)
names(country.data) <- c("Country_Name", "GDP_person", "Population")
print(head(country.data))

Piece 3

# Convert Population to millions of people and
# GDP_person to dollars per day.  
country.data$Population <- country.data$Population/ 10^6
country.data$GDP_person <- country.data$GDP_person/ 365

# Make a histogram of GDP/person by
# country.  Not all that useful.  
hist(country.data$GDP_person, main = "", 
     xlab = "GDP/person", ylab = "Number of countries")

Our histogram

Piece 4

# Sort the data according to the GDP/person
# column and look at the data again.  
sort.data <- arrange(country.data, GDP_person)
print(head(sort.data))

# Create a new column that contains cumulative
# population and look at the updated data.  
sort.data[, 4] <- cumsum(sort.data[, 3])
names(sort.data)[4] <- "Cumulative_population"
print(head(sort.data))

Piece 5

# And plot up the results.  We use magplot
# instead of plot so that we can have pretty
# logarithmic axes.  
magplot(sort.data$GDP_person, sort.data$Cumulative_population, 
        type = "s", log = "x", 
        main = "World income distribution", 
        xlab = "GDP/person ($/day)", 
        ylab = "Cumulative population (millions of people)")
points(sort.data$GDP_person, sort.data$Cumulative_population)

And the result

Back to the question

What is the world distribution of per capita income, in $/day?

How many people in the world live on

  • $3/day or less? (~$1/day)
  • between $3/day and $30/day? (~$10/day)
  • between $30/day and $300/day? (~$100/day)