The United Nations Voting Dataset

Filtering rows

The vote column in the dataset has a number that represents that country’s vote:

One step of data cleaning is removing observations (rows) that you’re not interested in. In this case, you want to remove “Not present” and “Not a member”.

# Load the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
votes <- readRDS("_data/votes.rds")

# Print the votes dataset
votes
## # A tibble: 508,929 x 4
##     rcid session  vote ccode
##    <dbl>   <dbl> <dbl> <int>
##  1    46       2     1     2
##  2    46       2     1    20
##  3    46       2     9    31
##  4    46       2     1    40
##  5    46       2     1    41
##  6    46       2     1    42
##  7    46       2     9    51
##  8    46       2     9    52
##  9    46       2     9    53
## 10    46       2     9    54
## # ... with 508,919 more rows
# Filter for votes that are "yes", "abstain", or "no"
votes %>%
  filter(vote <= 3)
## # A tibble: 353,547 x 4
##     rcid session  vote ccode
##    <dbl>   <dbl> <dbl> <int>
##  1    46       2     1     2
##  2    46       2     1    20
##  3    46       2     1    40
##  4    46       2     1    41
##  5    46       2     1    42
##  6    46       2     1    70
##  7    46       2     1    90
##  8    46       2     1    91
##  9    46       2     1    92
## 10    46       2     1    93
## # ... with 353,537 more rows

Adding a year column

The next step of data cleaning is manipulating your variables (columns) to make them more informative.

In this case, you have a session column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session number.

# Add another %>% step to add a year column
votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945)
## # A tibble: 353,547 x 5
##     rcid session  vote ccode  year
##    <dbl>   <dbl> <dbl> <int> <dbl>
##  1    46       2     1     2  1947
##  2    46       2     1    20  1947
##  3    46       2     1    40  1947
##  4    46       2     1    41  1947
##  5    46       2     1    42  1947
##  6    46       2     1    70  1947
##  7    46       2     1    90  1947
##  8    46       2     1    91  1947
##  9    46       2     1    92  1947
## 10    46       2     1    93  1947
## # ... with 353,537 more rows

Adding a country column

The country codes in the ccode column are what’s called Correlates of War codes. This isn’t ideal for an analysis, since you’d like to work with recognizable country names.

You can use the countrycode package to translate. For example:

library(countrycode)

# Translate the country code 2
> countrycode(2, "cown", "country.name")
[1] "United States"

# Translate multiple country codes
> countrycode(c(2, 20, 40), "cown", "country.name")
[1] "United States" "Canada"        "Cuba"
# Load the countrycode package
library(countrycode)

# Convert country code 100
countrycode(100, "cown", "country.name")
## [1] "Colombia"
# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945,
         country = countrycode(ccode, "cown", "country.name"))
## Warning: Problem with `mutate()` input `country`.
## i Some values were not matched unambiguously: 260
## 
## i Input `country` is `countrycode(ccode, "cown", "country.name")`.
## Warning in countrycode(ccode, "cown", "country.name"): Some values were not matched unambiguously: 260

Grouping and summarizing

Summarizing the full dataset

In this analysis, you’re going to focus on “% of votes that are yes” as a metric for the “agreeableness” of countries.

You’ll start by finding this summary for the entire dataset: the fraction of all votes in their history that were “yes”. Note that within your call to summarize(), you can use n() to find the total number of votes and mean(vote == 1) to find the fraction of “yes” votes.

# Print votes_processed
votes_processed
## # A tibble: 353,547 x 6
##     rcid session  vote ccode  year country           
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>             
##  1    46       2     1     2  1947 United States     
##  2    46       2     1    20  1947 Canada            
##  3    46       2     1    40  1947 Cuba              
##  4    46       2     1    41  1947 Haiti             
##  5    46       2     1    42  1947 Dominican Republic
##  6    46       2     1    70  1947 Mexico            
##  7    46       2     1    90  1947 Guatemala         
##  8    46       2     1    91  1947 Honduras          
##  9    46       2     1    92  1947 El Salvador       
## 10    46       2     1    93  1947 Nicaragua         
## # ... with 353,537 more rows
# Find total and fraction of "yes" votes
votes_processed %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))
## # A tibble: 1 x 2
##    total percent_yes
##    <int>       <dbl>
## 1 353547       0.800

Summarizing by year

The summarize() function is especially useful because it can be used within groups.

For example, you might like to know how much the average “agreeableness” of countries changed from year to year. To examine this, you can use group_by() to perform your summary not for the entire dataset, but within each year.

# Change this code to summarize by year
votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 34 x 3
##     year total percent_yes
##    <dbl> <int>       <dbl>
##  1  1947  2039       0.569
##  2  1949  3469       0.438
##  3  1951  1434       0.585
##  4  1953  1537       0.632
##  5  1955  2169       0.695
##  6  1957  2708       0.609
##  7  1959  4326       0.588
##  8  1961  7482       0.573
##  9  1963  3308       0.729
## 10  1965  4382       0.708
## # ... with 24 more rows

Nice one! The group_by() function must go before your call to summarize() when you’re trying to perform your summary within groups.

Summarizing by country

In the last exercise, you performed a summary of the votes within each year. You could instead summarize() within each country, which would let you compare voting patterns between countries.

# Summarize by country: by_country
by_country <- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)

Sorting and filtering summarized data

Sorting by percentage of “yes” votes

Now that you’ve summarized the dataset by country, you can start examining it and answering interesting questions.

For example, you might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.

# You have the votes summarized by country
by_country <- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
# Print the by_country dataset
by_country
## # A tibble: 200 x 3
##    country           total percent_yes
##    <chr>             <int>       <dbl>
##  1 Afghanistan        2373       0.859
##  2 Albania            1695       0.717
##  3 Algeria            2213       0.899
##  4 Andorra             719       0.638
##  5 Angola             1431       0.924
##  6 Antigua & Barbuda  1302       0.912
##  7 Argentina          2553       0.768
##  8 Armenia             758       0.747
##  9 Australia          2575       0.557
## 10 Austria            2389       0.622
## # ... with 190 more rows
# Sort in ascending order of percent_yes
by_country %>%
  arrange(percent_yes)
## # A tibble: 200 x 3
##    country                          total percent_yes
##    <chr>                            <int>       <dbl>
##  1 Zanzibar                             2       0    
##  2 United States                     2568       0.269
##  3 Palau                              369       0.339
##  4 Israel                            2380       0.341
##  5 <NA>                              1075       0.397
##  6 United Kingdom                    2558       0.417
##  7 France                            2527       0.427
##  8 Micronesia (Federated States of)   724       0.442
##  9 Marshall Islands                   757       0.491
## 10 Belgium                           2568       0.492
## # ... with 190 more rows
# Now sort in descending order
by_country %>%
  arrange(desc(percent_yes))
## # A tibble: 200 x 3
##    country              total percent_yes
##    <chr>                <int>       <dbl>
##  1 São Tomé & Príncipe   1091       0.976
##  2 Seychelles             881       0.975
##  3 Djibouti              1598       0.961
##  4 Guinea-Bissau         1538       0.960
##  5 Timor-Leste            326       0.957
##  6 Mauritius             1831       0.950
##  7 Zimbabwe              1361       0.949
##  8 Comoros               1133       0.947
##  9 United Arab Emirates  1934       0.947
## 10 Mozambique            1701       0.947
## # ... with 190 more rows

Filtering summarized output

In the last exercise, you may have noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. You certainly can’t make any substantial conclusions based on that data!

Typically in a progressive analysis, when you find that a few of your observations have very little data while others have plenty, you set some threshold to filter them out.

# Filter out countries with fewer than 100 votes
by_country %>%
  arrange(percent_yes) %>%
  filter(total >= 100)
## # A tibble: 197 x 3
##    country                          total percent_yes
##    <chr>                            <int>       <dbl>
##  1 United States                     2568       0.269
##  2 Palau                              369       0.339
##  3 Israel                            2380       0.341
##  4 <NA>                              1075       0.397
##  5 United Kingdom                    2558       0.417
##  6 France                            2527       0.427
##  7 Micronesia (Federated States of)   724       0.442
##  8 Marshall Islands                   757       0.491
##  9 Belgium                           2568       0.492
## 10 Canada                            2576       0.508
## # ... with 187 more rows

Visualization with ggplot2

Plotting a line over time

In the last chapter, you learned how to summarize() the votes dataset by year, particularly the percentage of votes in each year that were “yes”.

You’ll now use the ggplot2 package to turn your results into a visualization of the percentage of “yes” votes over time.

# Define by_year
by_year <- votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))
## `summarise()` ungrouping output (override with `.groups` argument)
# Load the ggplot2 package
library(ggplot2)

# Create line plot
ggplot(by_year, aes(year, percent_yes)) +
  geom_line()

Other ggplot2 layers

A line plot is one way to display this data. You could also choose to display it as a scatter plot, with each year represented as a single point. This requires changing the layer (i.e. geom_line() to geom_point()).

You can also add additional layers to your graph, such as a smoothing curve with geom_smooth().

# Change to scatter plot and add smoothing curve
ggplot(by_year, aes(year, percent_yes)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Visualizing by country