Rohit Chaubey
November 24, 2017
Description:
We’ll explore the historical voting of the United Nations General Assembly, including analyzing differences in voting between countries, across time, and among international issues.
In the process we’ll gain more practice with the dplyr and ggplot2 packages, learn about the broom package for tidying model output, and experience the kind of start-to-finish exploratory analysis common in data science
Steps involved while analyzing the data sets are as follows:
Data cleaning and summarizing with dplyr
Data visualization with ggplot2
Tidy modeling with broom
Joining and tidying
DataSets used are as follows:
Votes.rds <- Number of votes per country per year.
Description.rds <- Topics raised in the session and countries vote for the topics.
Country Code <- Translation of country code to country name.
Packages used are as follows:
dplyr <- Pipe operator, filter, mutate, select operation on data set.
ggplot2 <- Data Visualization using line chart, scatter plot, abline for regession models.
broom <- Tidying linear regression models.
tidyr <- nest and unnest a data frame.
purrr <- map function to apply formula to each element in a data set.
Load the packages
library(dplyr)
library(ggplot2)
library(broom)
library(tidyr)
library(purrr)
Read the Votes file and Join it with COW Country Code to extract corresponding Country names.
votes <- readRDS("votes.rds")
votes$year = votes$session + 1945
countrycode <- read.csv("COW country codes.csv")
votes_processed <- inner_join(votes, countrycode, by = c("ccode" = "CCode"))
colnames(votes_processed)[colnames(votes_processed) == 'StateNme'] <- 'Country'
Summarize total number of votes by Year
votes_processed %>%
group_by(year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## # A tibble: 34 × 3
## year total percent_yes
## <dbl> <int> <dbl>
## 1 1947 8436 0.1787577
## 2 1949 14208 0.1392173
## 3 1951 5550 0.1936937
## 4 1953 5772 0.2118850
## 5 1955 8214 0.2307037
## 6 1957 7548 0.2759671
## 7 1959 11988 0.2665165
## 8 1961 16872 0.3089142
## 9 1963 7104 0.4034347
## 10 1965 9102 0.4030982
## # ... with 24 more rows
Summarize total number of votes by Country
by_country <- votes_processed %>%
group_by(Country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
Now that we’ve summarized the dataset by country, we can start examining it and answering interesting questions.
For example, one might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.
by_country %>%
arrange(percent_yes)
by_country %>%
arrange(desc(percent_yes))
Filtering summarized output
We noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. We certainly can’t make any substantial conclusions based on that data!
Typically in a progressive analysis, when we find that a few of our observations have very little data while others have plenty, we will set some threshold to filter them out.
Filter out countries with fewer than 100 votes
by_country %>%
arrange(percent_yes) %>%
filter(total > 100)
We will now use the ggplot2 package to turn your results into a visualization of the percentage of “yes” votes over time.
by_year <- votes_processed %>%
group_by(year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
ggplot(by_year, aes(year, percent_yes)) +
geom_line() +
ggtitle("Line Chart")
ggplot(by_year, aes(year, percent_yes)) +
geom_point() +
geom_smooth() +
ggtitle("Scatter Plot")


Summarizing by year and country
We are more interested in trends of voting within specific countries than in the overall trend.
So instead of summarizing just by year, we will summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.
by_year_country <- votes_processed %>%
group_by(Country, year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
Plotting just the UK over time
Now that we have the percentage of time that each country voted “yes” within each year, we can plot the trend for a particular country.
In this case, we’ll look at the trend for just the United Kingdom.
UK_by_year <- by_year_country %>%
filter(Country == "United Kingdom")
ggplot(UK_by_year, aes(year, percent_yes)) +
geom_line() +
ggtitle("Percentage yes of UK over time")

Plotting multiple countries
Plotting just one country at a time is interesting, but we really want to compare trends between countries. For example, suppose we want to compare voting trends for the United States, the UK, France, and India.
# Vector of six countries to examine
countries <- c("China", "United Kingdom",
"France", "Japan", "Brazil", "India")
# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
filter(Country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
geom_line() +
ggtitle("Percentage of yes over time - Faceted by specific countires") +
facet_wrap(~ Country, scales = "free_y")

Linear regression on the United States
A linear regression is a model that lets us examine how one variable changes with respect to another by fitting a best fit line. It is done with the lm()function in R.
Here, we’ll fit a linear regression to just the percentage of “yes” votes from the United States.
# Percentage of yes votes from the US by year: US_by_year
UK_by_year <- by_year_country %>%
filter(Country == "United Kingdom")
# Perform a linear regression of percent_yes by year: US_fit
UK_fit <- lm(percent_yes ~ year, data = UK_by_year)
# Perform summary() on the US_fit object
summary(UK_fit)
##
## Call:
## lm(formula = percent_yes ~ year, data = UK_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17371 -0.08172 -0.01991 0.08489 0.26935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.5419969 1.9270751 -1.838 0.0754 .
## year 0.0020058 0.0009732 2.061 0.0475 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1113 on 32 degrees of freedom
## Multiple R-squared: 0.1172, Adjusted R-squared: 0.08959
## F-statistic: 4.248 on 1 and 32 DF, p-value: 0.04751
Tidying models with broom
Now we will use the tidy()function in the broom package to turn that model into a tidy data frame
# Call the tidy() function on the UK_fit object
tidy(UK_fit)
## term estimate std.error statistic p.value
## 1 (Intercept) -3.541996852 1.9270750888 -1.838017 0.07535928
## 2 year 0.002005776 0.0009732225 2.060964 0.04751124
Combining models for multiple countries
One important advantage of changing models to tidied data frames is that they can be combined.
We had fit a linear model to the percentage of “yes” votes for each year in the United Kingdom. Now you’ll fit the same model for India and combine the results from both countries.
# Fit model for the India
IN_by_year <- by_year_country %>%
filter(Country == "India")
IN_fit <- lm(percent_yes ~ year, IN_by_year)
# Create US_tidied and UK_tidied
UK_tidied <- tidy(UK_fit)
IN_tidied <- tidy(IN_fit)
# Combine the two tidied models
bind_rows(UK_tidied, IN_tidied)
## term estimate std.error statistic p.value
## 1 (Intercept) -3.541996852 1.9270750888 -1.838017 0.075359276
## 2 year 0.002005776 0.0009732225 2.060964 0.047511239
## 3 (Intercept) -5.294463740 1.9170812891 -2.761731 0.009444474
## 4 year 0.003065560 0.0009681753 3.166327 0.003381444
Nesting a Data Frame
Right now, the by_year_country data frame has one row per country-vote pair. So that we can model each country individually, we’re going to “nest” all columns besides country, which will result in a data frame with one row per country.
The data for each individual country will then be stored in a list column called data.
by_year_country <- by_year_country %>% select(2:4)
# Nest all columns besides country
nested <- nest(by_year_country, - Country)
Unnesting a Data Frame
The opposite of the nest() operation is the unnest() operation.
This takes each of the data frames in the list column and brings those rows back to the main data frame.
nested %>%
unnest(data)
Filtering for Slope and Significant Values
We currently have both the intercept and slope terms for each by-country model.
We’re probably more interested in how each is changing over time, so we want to focus on the slope terms.
p.adjust(p.value) on a vector of p-values returns a set that we can trust.
Here we’ll add two steps to process the slope_terms dataset: use a mutate to create the new, adjusted p-value column, and filter to filter for those below a .05 threshold.
# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < .05)
# Sort for the countries increasing most quickly
filtered_countries %>% arrange(desc(estimate))
# Sort for the countries decreasing most quickly
filtered_countries %>% arrange(estimate)
We have created the votes_processed dataset, containing information about each country’s votes. You’ll now combine that with the new descriptions dataset, which includes topic information about each country, so that you can analyze votes within particular topics.
To do this, you’ll make use of the inner_join() function from dplyr.
descriptions <- readRDS("descriptions.rds")
# Join them together based on the "rcid" and "session" columns
votes_joined <- votes_processed %>% inner_join(descriptions, by = c("rcid", "session"))
Filtering the joined dataset
There are six columns in the descriptions dataset (and therefore in the new joined dataset) that describe the topic of a resolution:
1. me: Palestinian conflict
2. nu: Nuclear weapons and nuclear material
3. di: Arms control and disarmament
4. hr: Human rights
5. co: Colonialism
6. ec: Economic development
Each contains a 1 if the resolution is related to this topic and a 0 otherwise.
Using gather to tidy a dataset
In order to represent the joined vote-topic data in a tidy form so we can analyze and graph by topic, we need to transform the data so that each row has one combination of country-vote-topic.
This will change the data from having six columns (me, nu, di, hr, co, ec) to having two columns (topic and has_topic).
# Perform gather and filter
votes_gathered <- votes_joined %>% gather(topic, has_topic, me:ec) %>% filter(has_topic == 1)
Recoding the topics
There’s one more step of data cleaning to make this more interpretable. Right now, topics are represented by two-letter codes:
1. me: Palestinian conflict
2. nu: Nuclear weapons and nuclear material
3. di: Arms control and disarmament
4. hr: Human rights
5. co: Colonialism
6. ec: Economic development
So that we can interpret the data more easily, recode the data to replace these codes with their full name.
# Replace the two-letter codes in topic: votes_tidied
votes_tidied <- votes_gathered%>%
mutate(topic = recode(topic,
me = "Palestinian conflict",
nu = "Nuclear weapons and nuclear material",
di = "Arms control and disarmament",
hr = "Human rights",
co = "Colonialism",
ec = "Economic development"))
Summarize by country, year, and topic
Now that you have topic as an additional variable, you can summarize the votes for each combination of country, year, and topic
e.g. for the United States in 2013 on the topic of nuclear weapons.
# Summarize the percentage "yes" per country-year-topic
by_country_year_topic <- votes_tidied %>%
group_by(Country, year, topic) %>%
summarize(total = n(), percent_yes = mean(vote == 1)) %>%
ungroup()
Visualizing trends in topics for one country
You can now visualize the trends in percentage “yes” over time for all six topics side-by-side.
Here, you’ll visualize them just for United Kingdom.
# Filter by_country_year_topic for just the US
UK_by_country_year_topic <- by_country_year_topic %>% filter(Country == "United Kingdom")
# Plot % yes over time for the US, faceting by topic
UK_by_country_year_topic %>%
ggplot(aes(x= year, y = percent_yes)) +
geom_line() +
ggtitle("United Kingdom's voting pattern for different topics over the years") +
facet_wrap(~topic)

Linear models for each combination of country and topic
# Fit model on the by_country_year_topic dataset
country_topic_coefficients <- by_country_year_topic %>%
nest(-Country, -topic) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
# Print country_topic_coefficients
country_topic_coefficients
## # A tibble: 2,400 × 7
## Country topic term estimate
## <fctr> <chr> <chr> <dbl>
## 1 Afghanistan Colonialism (Intercept) -2.781507722
## 2 Afghanistan Colonialism year 0.001826514
## 3 Afghanistan Economic development (Intercept) -10.965430258
## 4 Afghanistan Economic development year 0.005935676
## 5 Afghanistan Human rights (Intercept) -3.171885627
## 6 Afghanistan Human rights year 0.001985230
## 7 Afghanistan Palestinian conflict (Intercept) -7.552260810
## 8 Afghanistan Palestinian conflict year 0.004220446
## 9 Afghanistan Arms control and disarmament (Intercept) -10.355495881
## 10 Afghanistan Arms control and disarmament year 0.005625035
## # ... with 2,390 more rows, and 3 more variables: std.error <dbl>,
## # statistic <dbl>, p.value <dbl>
Filter only for Slope terms with significant P value
# Create country_topic_filtered
country_topic_filtered <- country_topic_coefficients %>% filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < .05)
Visualization of the models
We found that over its history, Vanuatu (an island nation in the Pacific Ocean) sharply changed its pattern of voting on the topic of Palestinian conflict.
Let’s examine this country’s voting patterns more closely. Recall that the by_country_year_topic dataset contained one row for each combination of country, year, and topic.
We can use that to create a plot of Vanuatu’s voting, faceted by topic.
# Create vanuatu_by_country_year_topic
vanuatu_by_country_year_topic <- by_country_year_topic %>% filter(Country == "Vanuatu")
# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(x = year, y = percent_yes)) +
geom_line() +
ggtitle("Vanatu's voting patterns for different topics over the years") +
facet_wrap(~topic)
