Once you’ve started learning tools for data manipulation and visualization like dplyr and ggplot2, this course gives you a chance to use them in action on a real dataset. You’ll explore the historical voting of the United Nations General Assembly, including analyzing differences in voting between countries, across time, and among international issues. In the process you’ll gain more practice with the dplyr and ggplot2 packages, learn about the broom package for tidying model output, and experience the kind of start-to-finish exploratory analysis common in data science.

1: Data cleaning and summarizing with dplyr

The best way to learn data wrangling skills is to apply them to a specific case study. Here you’ll learn how to clean and filter the United Nations voting dataset using the dplyr package, and how to summarize it into smaller, interpretable units.

url <- "https://github.com/datasciencelabs/data/raw/master/rawvotingdata13.tab"
filename <- basename(url)
if (!file.exists(filename)) download(url,destfile=filename)
votes <-read.delim("rawvotingdata13.tab", header = TRUE, sep = "\t",
quote = "")
votes <- votes %>% 
  filter(session != 19)

Video: The United Nations Voting Dataset

View slides.

str(votes)
## 'data.frame':    1051555 obs. of  4 variables:
##  $ rcid   : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ session: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ vote   : num  1 3 9 1 1 1 9 9 9 9 ...
##  $ ccode  : num  2 20 31 40 41 42 51 52 53 54 ...
head(votes)
##   rcid session vote ccode
## 1    3       1    1     2
## 2    3       1    3    20
## 3    3       1    9    31
## 4    3       1    1    40
## 5    3       1    1    41
## 6    3       1    1    42
votes <- votes %>% 
  mutate(year = session + 1945)

Filtering rows

The vote column in the dataset has a number that represents that country’s vote:

  • 1 = Yes
  • 2 = Abstain
  • 3 = No
  • 8 = Not present
  • 9 = Not a member

One step of data cleaning is removing observations (rows) that you’re not interested in. In this case, you want to remove “Not present” and “Not a member”.

unique(votes$vote)
## [1] 1 3 9 2 8
# Filter for votes that are "yes", "abstain", or "no"
votes %>% 
  filter(vote <= 3)

Adding a year column

The next step of data cleaning is manipulating your variables (columns) to make them more informative.

In this case, you have a session column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session number.

# Add another %>% step to add a year column
votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945)

Adding a country column

The country codes in the ccode column are what’s called Correlates of War codes. This isn’t ideal for an analysis, since you’d like to work with recognizable country names.

You can use the countrycode package to translate. For example:

library(countrycode)

# Translate the country code 2
countrycode(2, "cown", "country.name")
## [1] "United States"
# Translate the country code 703
countrycode(703, "cown", "country.name")
## [1] "Kyrgyzstan"
# Translate multiple country codes
countrycode(c(2, 20, 40), "cown", "country.name")
## [1] "United States" "Canada"        "Cuba"
# Convert country code 100
countrycode(100, "cown", "country.name")

# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945,
  country = countrycode(ccode, "cown", "country.name"))
## Warning in countrycode(ccode, "cown", "country.name"): Some values were not matched unambiguously: 260
votes_processed$country[votes_processed$ccode == "260"] <- "German Federal Republic"

Video: Grouping and summarizing

View slides.

Summarizing the full dataset

In this analysis, you’re going to focus on “% of votes that are yes” as a metric for the “agreeableness” of countries.

You’ll start by finding this summary for the entire dataset: the fraction of all votes in their history that were “yes”. Note that within your call to summarize(), you can use n() to find the total number of votes and mean(vote == 1) to find the fraction of “yes” votes.

head(votes_processed)
##   rcid session vote ccode year            country
## 1    3       1    1     2 1946      United States
## 2    3       1    3    20 1946             Canada
## 3    3       1    1    40 1946               Cuba
## 4    3       1    1    41 1946              Haiti
## 5    3       1    1    42 1946 Dominican Republic
## 6    3       1    1    70 1946             Mexico
votes_processed %>% 
  summarize(total = n(),
            percent_yes = mean(vote == 1))
##    total percent_yes
## 1 724922   0.7969878

Summarizing by year

The summarize() function is especially useful because it can be used within groups.

For example, you might like to know how much the average “agreeableness” of countries changed from year to year. To examine this, you can use group_by() to perform your summary not for the entire dataset, but within each year.

by_year <- votes_processed %>%
  group_by(year) %>% 
  summarize(total = n(),
            percent_yes = mean(vote == 1))

The group_by() function must go before your call to summarize() when you’re trying to perform your summary within groups.

Summarizing by country

In the last exercise, you performed a summary of the votes within each year. You could instead summarize() within each country, which would let you compare voting patterns between countries.

# Summarize by country: by_country
by_country <- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

Video: Sorting and filtering summarized data

View slides.

Transforming tidy data. Source: DataCamp

Sorting by percentage of “yes” votes

Now that you’ve summarized the dataset by country, you can start examining it and answering interesting questions.

For example, you might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.

# Print first few entries of the by_country dataset
head(by_country)
## # A tibble: 6 x 3
##   country           total percent_yes
##   <chr>             <int>       <dbl>
## 1 Afghanistan        4899       0.840
## 2 Albania            3439       0.719
## 3 Algeria            4450       0.898
## 4 Andorra            1487       0.649
## 5 Angola             3025       0.923
## 6 Antigua & Barbuda  2595       0.918
# Sort in ascending order of percent_yes
by_country %>%
arrange(percent_yes)
## # A tibble: 200 x 3
##    country                          total percent_yes
##    <chr>                            <int>       <dbl>
##  1 Zanzibar                             2       0    
##  2 United States                     5312       0.285
##  3 Palau                              841       0.313
##  4 Israel                            4866       0.349
##  5 German Federal Republic           2151       0.398
##  6 Micronesia (Federated States of)  1405       0.414
##  7 United Kingdom                    5294       0.428
##  8 France                            5247       0.433
##  9 Marshall Islands                  1534       0.485
## 10 Belgium                           5313       0.495
## # ... with 190 more rows
# Now sort in descending order
by_country %>%
arrange(desc(percent_yes))
## # A tibble: 200 x 3
##    country              total percent_yes
##    <chr>                <int>       <dbl>
##  1 Seychelles            1771       0.978
##  2 Timor-Leste            769       0.969
##  3 São Tomé & Príncipe   2388       0.967
##  4 Djibouti              3269       0.956
##  5 Guinea-Bissau         3001       0.955
##  6 Cape Verde            3223       0.946
##  7 Comoros               2468       0.945
##  8 Mozambique            3382       0.943
##  9 United Arab Emirates  3954       0.942
## 10 Suriname              3354       0.941
## # ... with 190 more rows

Filtering summarized output

In the last exercise, you may have noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. You certainly can’t make any substantial conclusions based on that data!

Typically in a progressive analysis, when you find that a few of your observations have very little data while others have plenty, you set some threshold to filter them out.

# Filter out countries with fewer than 100 votes
by_country %>%
filter(total >= 100) %>%
  arrange(percent_yes)
## # A tibble: 199 x 3
##    country                          total percent_yes
##    <chr>                            <int>       <dbl>
##  1 United States                     5312       0.285
##  2 Palau                              841       0.313
##  3 Israel                            4866       0.349
##  4 German Federal Republic           2151       0.398
##  5 Micronesia (Federated States of)  1405       0.414
##  6 United Kingdom                    5294       0.428
##  7 France                            5247       0.433
##  8 Marshall Islands                  1534       0.485
##  9 Belgium                           5313       0.495
## 10 Luxembourg                        5245       0.513
## # ... with 189 more rows

2: Visualization with ggplot2

Once you’ve cleaned and summarized data, you’ll want to visualize them to understand trends and extract insights. Here you’ll use the ggplot2 package to explore trends in United Nations voting within each country over time.

Video: Visualization with ggplot2

View slides.

Choosing an aesthetic

You’re going to create a line graph to show the trend over time of how many votes are “yes”.

Which of the following aesthetics should you map the year variable to?

It’s“X-axis”.

To plot a line graph to show the trend over time, the year variable should be on the x-axis.

# Create line plot
ggplot(by_year, aes(x = year, y = percent_yes)) +
  geom_line()

Other ggplot2 layers

A line plot is one way to display this data. You could also choose to display it as a scatter plot, with each year represented as a single point. This requires changing the layer (i.e. geom_line() to geom_point()).

You can also add additional layers to your graph, such as a smoothing curve with geom_smooth().

# Change to scatter plot and add smoothing curve
ggplot(by_year, aes(year, percent_yes)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Video: Visualizing by country

View slides.

by_year_country <- votes_processed %>%
  group_by(year, country) %>% 
  summarize(total = n(),
            percent_yes = mean(vote == 1))

us_france <- by_year_country %>% 
  filter(country %in% c("United States", "France"))
us_france
## # A tibble: 136 x 4
## # Groups:   year [68]
##     year country       total percent_yes
##    <dbl> <chr>         <int>       <dbl>
##  1  1946 France           43       0.558
##  2  1946 United States    43       0.605
##  3  1947 France           38       0.737
##  4  1947 United States    38       0.711
##  5  1948 France          104       0.452
##  6  1948 United States   104       0.452
##  7  1949 France           64       0.312
##  8  1949 United States    64       0.281
##  9  1950 France           53       0.321
## 10  1950 United States    53       0.491
## # ... with 126 more rows
ggplot(us_france, aes(x = year, y = percent_yes, color = country)) +
  geom_line() +
  ylim(0,1)

Summarizing by year and country

You’re more interested in trends of voting within specific countries than you are in the overall trend. So instead of summarizing just by year, summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.

by_year_country <- votes_processed %>%
  group_by(year, country) %>% 
  summarize(total = n(),
            percent_yes = mean(vote == 1))

Let’s make some plots using this new dataset in the next exercise.

Plotting just the UK over time

Now that you have the percentage of time that each country voted “yes” within each year, you can plot the trend for a particular country. In this case, you’ll look at the trend for just the United Kingdom.

This will involve using filter() on your data before giving it to ggplot2.

# Create a filtered version: UK_by_year
UK_by_year <- by_year_country %>%
filter(country %in% c("United Kingdom"))

# Line plot of percent_yes over time for UK only
ggplot(UK_by_year, aes(x = year, y = percent_yes)) +
  geom_line()

# Create a filtered version: UK_by_year
Kg_by_year <- by_year_country %>%
filter(country %in% c("Kyrgyzstan"))

# Line plot of percent_yes over time for UK only
ggplot(Kg_by_year, aes(x = year, y = percent_yes)) +
  geom_line() +
  ylim(0,1)

Plotting multiple countries

Plotting just one country at a time is interesting, but you really want to compare trends between countries. For example, suppose you want to compare voting trends for the United States, the UK, France, and India.

You’ll have to filter to include all four of these countries and use another aesthetic (not just x- and y-axes) to distinguish the countries on the resulting visualization. Instead, you’ll use the color aesthetic to represent different countries.

# Vector of four countries to examine
countries <- c("United States", "United Kingdom",
               "France", "India")

# Filter by_year_country: filtered_4_countries
filtered_4_countries <- by_year_country %>%
filter(country %in% countries)

# Line plot of % yes in four countries
ggplot(filtered_4_countries, aes(x = year, y = percent_yes, color = country)) +
  geom_line()

Video: Faceting by country

View slides.

Faceting the time series

Now you’ll take a look at six countries. While in the previous exercise you used color to represent distinct countries, this gets a little too crowded with six.

Instead, you will facet, giving each country its own sub-plot. To do so, you add a facet_wrap() step after all of your layers.

# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
               "France", "Japan", "Brazil", "India")

# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
filter(country %in% countries)

# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(x = year, y = percent_yes)) +
  geom_line() +
  facet_wrap(~country)

Faceting with free y-axis

In the previous plot, all six graphs had the same axis limits. This made the changes over time hard to examine for plots with relatively little change.

Instead, you may want to let the plot choose a different y-axis for each facet.

# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ country, scales="free_y")

Choose your own countries

The purpose of an exploratory data analysis is to ask questions and answer them with data. Now it’s your turn to ask the questions.

You’ll choose some countries whose history you are interested in and add them to the graph. If you want to look up the full list of countries, enter by_country$country in the console.

# Add three more countries to this list
countries <- c("United States", "United Kingdom",
               "France", "Japan", "Brazil", "India", "Kyrgyzstan", 
               "Georgia", "Germany")

# Filtered by_year_country: filtered_countries
filtered_countries <- by_year_country %>%
  filter(country %in% countries)

# Line plot of % yes over time faceted by country
ggplot(filtered_countries, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ country, scales = "free_y")


3: Tidy modeling with broom

While visualization helps you understand one country at a time, statistical modeling lets you quantify trends across many countries and interpret them together. Here you’ll learn to use the tidyr, purrr, and broom packages to fit linear models to each country, and understand and compare their outputs.

Video: Linear regression

View slides.