Rohit Chaubey

rnc170030@utdallas.edu

November 24, 2017

Description:

We’ll explore the historical voting of the United Nations General Assembly, including analyzing differences in voting between countries, across time, and among international issues.

In the process we’ll gain more practice with the dplyr and ggplot2 packages, learn about the broom package for tidying model output, and experience the kind of start-to-finish exploratory analysis common in data science

Steps involved while analyzing the data sets are as follows:

Data cleaning and summarizing with dplyr

Data visualization with ggplot2

Tidy modeling with broom

Joining and tidying

DataSets used are as follows:

Votes.rds <- Number of votes per country per year.

Description.rds <- Topics raised in the session and countries vote for the topics.

Country Code <- Translation of country code to country name.

Packages used are as follows:

dplyr <- Pipe operator, filter, mutate, select operation on data set.

ggplot2 <- Data Visualization using line chart, scatter plot, abline for regession models.

broom <- Tidying linear regression models.

tidyr <- nest and unnest a data frame.

purrr <- map function to apply formula to each element in a data set.

Load the packages

library(dplyr)
library(ggplot2)
library(broom)
library(tidyr)
library(purrr)

Read the Votes file and Join it with COW Country Code to extract corresponding Country names.

votes <- readRDS("votes.rds")
votes$year = votes$session + 1945
countrycode <- read.csv("COW country codes.csv")
votes_processed <- inner_join(votes, countrycode, by = c("ccode" = "CCode"))
colnames(votes_processed)[colnames(votes_processed) == 'StateNme'] <- 'Country'

Summarize total number of votes by Year

votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## # A tibble: 34 × 3
##     year total percent_yes
##    <dbl> <int>       <dbl>
## 1   1947  8436   0.1787577
## 2   1949 14208   0.1392173
## 3   1951  5550   0.1936937
## 4   1953  5772   0.2118850
## 5   1955  8214   0.2307037
## 6   1957  7548   0.2759671
## 7   1959 11988   0.2665165
## 8   1961 16872   0.3089142
## 9   1963  7104   0.4034347
## 10  1965  9102   0.4030982
## # ... with 24 more rows

Summarize total number of votes by Country

by_country <- votes_processed %>%
  group_by(Country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

Now that we’ve summarized the dataset by country, we can start examining it and answering interesting questions.

For example, one might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.

by_country %>%
  arrange(percent_yes)

by_country %>%
  arrange(desc(percent_yes))

Filtering summarized output

We noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. We certainly can’t make any substantial conclusions based on that data!

Typically in a progressive analysis, when we find that a few of our observations have very little data while others have plenty, we will set some threshold to filter them out.

Filter out countries with fewer than 100 votes

by_country %>%
  arrange(percent_yes) %>%
  filter(total > 100)

We will now use the ggplot2 package to turn your results into a visualization of the percentage of “yes” votes over time.

by_year <- votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

ggplot(by_year, aes(year, percent_yes)) +
  geom_line() + 
  ggtitle("Line Chart")

ggplot(by_year, aes(year, percent_yes)) +
  geom_point() +
  geom_smooth() + 
  ggtitle("Scatter Plot")

Summarizing by year and country

We are more interested in trends of voting within specific countries than in the overall trend.

So instead of summarizing just by year, we will summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.

by_year_country <- votes_processed %>%
  group_by(Country, year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

Plotting just the UK over time

Now that we have the percentage of time that each country voted “yes” within each year, we can plot the trend for a particular country.

In this case, we’ll look at the trend for just the United Kingdom.

UK_by_year <- by_year_country %>%
  filter(Country == "United Kingdom")
ggplot(UK_by_year, aes(year, percent_yes)) +
  geom_line() +
  ggtitle("Percentage yes of UK over time")

Plotting multiple countries

Plotting just one country at a time is interesting, but we really want to compare trends between countries. For example, suppose we want to compare voting trends for the United States, the UK, France, and India.

# Vector of six countries to examine
countries <- c("China", "United Kingdom",
               "France", "Japan", "Brazil", "India")

# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
  filter(Country %in% countries)

# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
  geom_line() +
  ggtitle("Percentage of yes over time - Faceted by specific countires") +
  facet_wrap(~ Country, scales = "free_y")

Linear regression on the United States

A linear regression is a model that lets us examine how one variable changes with respect to another by fitting a best fit line. It is done with the lm()function in R.

Here, we’ll fit a linear regression to just the percentage of “yes” votes from the United States.

# Percentage of yes votes from the US by year: US_by_year
UK_by_year <- by_year_country %>%
  filter(Country == "United Kingdom")

# Perform a linear regression of percent_yes by year: US_fit
UK_fit <- lm(percent_yes ~ year, data = UK_by_year)

# Perform summary() on the US_fit object
summary(UK_fit)

## 
## Call:
## lm(formula = percent_yes ~ year, data = UK_by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17371 -0.08172 -0.01991  0.08489  0.26935 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.5419969  1.9270751  -1.838   0.0754 .
## year         0.0020058  0.0009732   2.061   0.0475 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1113 on 32 degrees of freedom
## Multiple R-squared:  0.1172, Adjusted R-squared:  0.08959 
## F-statistic: 4.248 on 1 and 32 DF,  p-value: 0.04751

Tidying models with broom

Now we will use the tidy()function in the broom package to turn that model into a tidy data frame

# Call the tidy() function on the UK_fit object
tidy(UK_fit)

##          term     estimate    std.error statistic    p.value
## 1 (Intercept) -3.541996852 1.9270750888 -1.838017 0.07535928
## 2        year  0.002005776 0.0009732225  2.060964 0.04751124

Combining models for multiple countries

One important advantage of changing models to tidied data frames is that they can be combined.

We had fit a linear model to the percentage of “yes” votes for each year in the United Kingdom. Now you’ll fit the same model for India and combine the results from both countries.

# Fit model for the India
IN_by_year <- by_year_country %>%
  filter(Country == "India")
IN_fit <- lm(percent_yes ~ year, IN_by_year)

# Create US_tidied and UK_tidied
UK_tidied <- tidy(UK_fit)
IN_tidied <- tidy(IN_fit)

# Combine the two tidied models
bind_rows(UK_tidied, IN_tidied)

##          term     estimate    std.error statistic     p.value
## 1 (Intercept) -3.541996852 1.9270750888 -1.838017 0.075359276
## 2        year  0.002005776 0.0009732225  2.060964 0.047511239
## 3 (Intercept) -5.294463740 1.9170812891 -2.761731 0.009444474
## 4        year  0.003065560 0.0009681753  3.166327 0.003381444

Nesting a Data Frame

Right now, the by_year_country data frame has one row per country-vote pair. So that we can model each country individually, we’re going to “nest” all columns besides country, which will result in a data frame with one row per country.

The data for each individual country will then be stored in a list column called data.

by_year_country <- by_year_country %>% select(2:4)
# Nest all columns besides country
nested <- nest(by_year_country, - Country)

Unnesting a Data Frame

The opposite of the nest() operation is the unnest() operation.

This takes each of the data frames in the list column and brings those rows back to the main data frame.

nested %>%
  unnest(data)

Performing linear regression on each nested dataset

Now that we’ve divided the data for each country into a separate dataset in the data column, we need to fit a linear model to each of these datasets.

The map() function from purrr works by applying a formula to each item in a list, where . represents the individual item.

This means that to fit a model to each dataset, you can do:

map(data, ~ lm(percent_yes ~ year, data = .))

where . represents each individual item from the data column in by_year_country

# Perform a linear regression on each item in the data column
by_year_country %>%
  nest(-Country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy))

## # A tibble: 200 × 4
##              Country              data    model               tidied
##               <fctr>            <list>   <list>               <list>
## 1        Afghanistan <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 2            Albania <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 3            Algeria <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 4            Andorra <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 5             Angola <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 6  Antigua & Barbuda <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 7          Argentina <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 8            Armenia <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 9          Australia <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## 10           Austria <tibble [34 × 3]> <S3: lm> <data.frame [2 × 5]>
## # ... with 190 more rows

Now that we have a tidied version of each model stored in the tidied column. We would want to combine all of those into a large data frame, similar to how you combined the UK and IN tidied models earlier.

# Add one more step that unnests the tidied column
country_coefficients <- by_year_country %>%
  nest(-Country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied)

# Print the resulting country_coefficients variable
country_coefficients

## # A tibble: 400 × 6
##        Country        term      estimate   std.error statistic
##         <fctr>       <chr>         <dbl>       <dbl>     <dbl>
## 1  Afghanistan (Intercept)  -6.086817653 2.542917862 -2.393635
## 2  Afghanistan        year   0.003458754 0.001284239  2.693233
## 3      Albania (Intercept) -12.650311249 3.350250153 -3.775930
## 4      Albania        year   0.006603772 0.001691962  3.903025
## 5      Algeria (Intercept) -29.965445162 4.068425314 -7.365367
## 6      Algeria        year   0.015473937 0.002054659  7.531145
## 7      Andorra (Intercept) -23.043446352 3.002579013 -7.674551
## 8      Andorra        year   0.011734405 0.001516380  7.738435
## 9       Angola (Intercept) -31.119806208 4.027720266 -7.726407
## 10      Angola        year   0.015923866 0.002034102  7.828449
## # ... with 390 more rows, and 1 more variables: p.value <dbl>

Filtering for Slope and Significant Values

We currently have both the intercept and slope terms for each by-country model.

We’re probably more interested in how each is changing over time, so we want to focus on the slope terms.

p.adjust(p.value) on a vector of p-values returns a set that we can trust.

Here we’ll add two steps to process the slope_terms dataset: use a mutate to create the new, adjusted p-value column, and filter to filter for those below a .05 threshold.

# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
  filter(term == "year") %>%
  mutate(p.adjusted = p.adjust(p.value)) %>%
  filter(p.adjusted < .05)

# Sort for the countries increasing most quickly
filtered_countries %>% arrange(desc(estimate))

# Sort for the countries decreasing most quickly
filtered_countries %>% arrange(estimate)

We have created the votes_processed dataset, containing information about each country’s votes. You’ll now combine that with the new descriptions dataset, which includes topic information about each country, so that you can analyze votes within particular topics.

To do this, you’ll make use of the inner_join() function from dplyr.

descriptions <- readRDS("descriptions.rds")
# Join them together based on the "rcid" and "session" columns
votes_joined <- votes_processed %>% inner_join(descriptions, by = c("rcid", "session"))

Filtering the joined dataset

There are six columns in the descriptions dataset (and therefore in the new joined dataset) that describe the topic of a resolution:

1. me: Palestinian conflict

2. nu: Nuclear weapons and nuclear material

3. di: Arms control and disarmament

4. hr: Human rights

5. co: Colonialism

6. ec: Economic development

Filter for Votes related to Colonialism

#Visualizing colonialism votes
UK_co_by_year <- votes_joined %>%
  filter(Country == "United Kingdom", co == 1) %>%
  group_by(year) %>%
  summarize(percent_yes = mean(vote == 1))

# Graph the % of "yes" votes over time
ggplot(UK_co_by_year, aes(year, percent_yes)) +
  ggtitle("United Kingdom's voting pattern for topics related to Colonialism") +
  geom_line()

Using gather to tidy a dataset

In order to represent the joined vote-topic data in a tidy form so we can analyze and graph by topic, we need to transform the data so that each row has one combination of country-vote-topic.

This will change the data from having six columns (me, nu, di, hr, co, ec) to having two columns (topic and has_topic).

# Perform gather and filter
votes_gathered <- votes_joined %>% gather(topic, has_topic, me:ec) %>% filter(has_topic == 1)

Recoding the topics

There’s one more step of data cleaning to make this more interpretable. Right now, topics are represented by two-letter codes:

1. me: Palestinian conflict

2. nu: Nuclear weapons and nuclear material

3. di: Arms control and disarmament

4. hr: Human rights

5. co: Colonialism

6. ec: Economic development

So that we can interpret the data more easily, recode the data to replace these codes with their full name.

# Replace the two-letter codes in topic: votes_tidied
votes_tidied <-  votes_gathered%>%
  mutate(topic = recode(topic,
                        me = "Palestinian conflict",
                        nu = "Nuclear weapons and nuclear material",
                        di = "Arms control and disarmament",
                        hr = "Human rights",
                        co = "Colonialism",
                        ec = "Economic development"))

Summarize by country, year, and topic

Now that you have topic as an additional variable, you can summarize the votes for each combination of country, year, and topic

e.g. for the United States in 2013 on the topic of nuclear weapons.

# Summarize the percentage "yes" per country-year-topic
by_country_year_topic <- votes_tidied %>%
  group_by(Country, year, topic) %>%
  summarize(total = n(), percent_yes = mean(vote == 1)) %>%
  ungroup()

Visualizing trends in topics for one country

You can now visualize the trends in percentage “yes” over time for all six topics side-by-side.

Here, you’ll visualize them just for United Kingdom.

# Filter by_country_year_topic for just the US
UK_by_country_year_topic <- by_country_year_topic %>% filter(Country == "United Kingdom")

# Plot % yes over time for the US, faceting by topic
UK_by_country_year_topic %>% 
  ggplot(aes(x= year, y = percent_yes)) +
  geom_line() +
  ggtitle("United Kingdom's voting pattern for different topics over the years") +
  facet_wrap(~topic)

Linear models for each combination of country and topic

# Fit model on the by_country_year_topic dataset
country_topic_coefficients <-  by_country_year_topic %>%
  nest(-Country, -topic) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied)

# Print country_topic_coefficients
country_topic_coefficients

## # A tibble: 2,400 × 7
##        Country                        topic        term      estimate
##         <fctr>                        <chr>       <chr>         <dbl>
## 1  Afghanistan                  Colonialism (Intercept)  -2.781507722
## 2  Afghanistan                  Colonialism        year   0.001826514
## 3  Afghanistan         Economic development (Intercept) -10.965430258
## 4  Afghanistan         Economic development        year   0.005935676
## 5  Afghanistan                 Human rights (Intercept)  -3.171885627
## 6  Afghanistan                 Human rights        year   0.001985230
## 7  Afghanistan         Palestinian conflict (Intercept)  -7.552260810
## 8  Afghanistan         Palestinian conflict        year   0.004220446
## 9  Afghanistan Arms control and disarmament (Intercept) -10.355495881
## 10 Afghanistan Arms control and disarmament        year   0.005625035
## # ... with 2,390 more rows, and 3 more variables: std.error <dbl>,
## #   statistic <dbl>, p.value <dbl>

Filter only for Slope terms with significant P value

# Create country_topic_filtered
country_topic_filtered <- country_topic_coefficients %>% filter(term == "year") %>% 
  mutate(p.adjusted = p.adjust(p.value)) %>%
  filter(p.adjusted < .05)

Visualization of the models

We found that over its history, Vanuatu (an island nation in the Pacific Ocean) sharply changed its pattern of voting on the topic of Palestinian conflict.

Let’s examine this country’s voting patterns more closely. Recall that the by_country_year_topic dataset contained one row for each combination of country, year, and topic.

We can use that to create a plot of Vanuatu’s voting, faceted by topic.

# Create vanuatu_by_country_year_topic
vanuatu_by_country_year_topic <- by_country_year_topic %>% filter(Country == "Vanuatu")

# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(x = year, y = percent_yes)) +
  geom_line() +
  ggtitle("Vanatu's voting patterns for different topics over the years") +
  facet_wrap(~topic)

United Nations Voting Data Analysis

Rohit Chaubey

rnc170030@utdallas.edu

November 24, 2017

Description:

We’ll explore the historical voting of the United Nations General Assembly, including analyzing differences in voting between countries, across time, and among international issues.

In the process we’ll gain more practice with the dplyr and ggplot2 packages, learn about the broom package for tidying model output, and experience the kind of start-to-finish exploratory analysis common in data science

Steps involved while analyzing the data sets are as follows:

Data cleaning and summarizing with dplyr

Data visualization with ggplot2

Tidy modeling with broom

Joining and tidying

DataSets used are as follows:

Votes.rds <- Number of votes per country per year.

Description.rds <- Topics raised in the session and countries vote for the topics.

Country Code <- Translation of country code to country name.

Packages used are as follows:

dplyr <- Pipe operator, filter, mutate, select operation on data set.

ggplot2 <- Data Visualization using line chart, scatter plot, abline for regession models.

broom <- Tidying linear regression models.

tidyr <- nest and unnest a data frame.

purrr <- map function to apply formula to each element in a data set.

Load the packages

Read the Votes file and Join it with COW Country Code to extract corresponding Country names.

Summarize total number of votes by Year

Summarize total number of votes by Country

Now that we’ve summarized the dataset by country, we can start examining it and answering interesting questions.

For example, one might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.

Filtering summarized output

We noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. We certainly can’t make any substantial conclusions based on that data!

Typically in a progressive analysis, when we find that a few of our observations have very little data while others have plenty, we will set some threshold to filter them out.

Filter out countries with fewer than 100 votes

We will now use the ggplot2 package to turn your results into a visualization of the percentage of “yes” votes over time.

Summarizing by year and country

We are more interested in trends of voting within specific countries than in the overall trend.

So instead of summarizing just by year, we will summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.

Plotting just the UK over time

Now that we have the percentage of time that each country voted “yes” within each year, we can plot the trend for a particular country.

In this case, we’ll look at the trend for just the United Kingdom.

Plotting multiple countries

Plotting just one country at a time is interesting, but we really want to compare trends between countries. For example, suppose we want to compare voting trends for the United States, the UK, France, and India.

Linear regression on the United States

A linear regression is a model that lets us examine how one variable changes with respect to another by fitting a best fit line. It is done with the lm()function in R.

Here, we’ll fit a linear regression to just the percentage of “yes” votes from the United States.

Tidying models with broom

Now we will use the tidy()function in the broom package to turn that model into a tidy data frame

Combining models for multiple countries

One important advantage of changing models to tidied data frames is that they can be combined.

We had fit a linear model to the percentage of “yes” votes for each year in the United Kingdom. Now you’ll fit the same model for India and combine the results from both countries.

Nesting a Data Frame

Right now, the by_year_country data frame has one row per country-vote pair. So that we can model each country individually, we’re going to “nest” all columns besides country, which will result in a data frame with one row per country.

The data for each individual country will then be stored in a list column called data.

Unnesting a Data Frame

The opposite of the nest() operation is the unnest() operation.

This takes each of the data frames in the list column and brings those rows back to the main data frame.

Performing linear regression on each nested dataset

Now that we’ve divided the data for each country into a separate dataset in the data column, we need to fit a linear model to each of these datasets.

The map() function from purrr works by applying a formula to each item in a list, where . represents the individual item.

This means that to fit a model to each dataset, you can do:

map(data, ~ lm(percent_yes ~ year, data = .))

where . represents each individual item from the data column in by_year_country

Now that we have a tidied version of each model stored in the tidied column. We would want to combine all of those into a large data frame, similar to how you combined the UK and IN tidied models earlier.

Filtering for Slope and Significant Values

We currently have both the intercept and slope terms for each by-country model.

We’re probably more interested in how each is changing over time, so we want to focus on the slope terms.

p.adjust(p.value) on a vector of p-values returns a set that we can trust.

Here we’ll add two steps to process the slope_terms dataset: use a mutate to create the new, adjusted p-value column, and filter to filter for those below a .05 threshold.

We have created the votes_processed dataset, containing information about each country’s votes. You’ll now combine that with the new descriptions dataset, which includes topic information about each country, so that you can analyze votes within particular topics.

To do this, you’ll make use of the inner_join() function from dplyr.

Filtering the joined dataset

There are six columns in the descriptions dataset (and therefore in the new joined dataset) that describe the topic of a resolution:

1. me: Palestinian conflict

2. nu: Nuclear weapons and nuclear material

3. di: Arms control and disarmament

4. hr: Human rights

5. co: Colonialism

6. ec: Economic development

Each contains a 1 if the resolution is related to this topic and a 0 otherwise.

Filter for Votes related to Colonialism

Using gather to tidy a dataset