Case Study: Exploratory Data Analysis in R

The United Nations Voting Dataset

Filtering rows

The vote column in the dataset has a number that represents that country’s vote:

1 = Yes
2 = Abstain
3 = No
8 = Not present
9 = Not a member

One step of data cleaning is removing observations (rows) that you’re not interested in. In this case, you want to remove “Not present” and “Not a member”.

# Load the dplyr package
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

votes <- readRDS("_data/votes.rds")

# Print the votes dataset
votes

## # A tibble: 508,929 x 4
##     rcid session  vote ccode
##    <dbl>   <dbl> <dbl> <int>
##  1    46       2     1     2
##  2    46       2     1    20
##  3    46       2     9    31
##  4    46       2     1    40
##  5    46       2     1    41
##  6    46       2     1    42
##  7    46       2     9    51
##  8    46       2     9    52
##  9    46       2     9    53
## 10    46       2     9    54
## # ... with 508,919 more rows

# Filter for votes that are "yes", "abstain", or "no"
votes %>%
  filter(vote <= 3)

## # A tibble: 353,547 x 4
##     rcid session  vote ccode
##    <dbl>   <dbl> <dbl> <int>
##  1    46       2     1     2
##  2    46       2     1    20
##  3    46       2     1    40
##  4    46       2     1    41
##  5    46       2     1    42
##  6    46       2     1    70
##  7    46       2     1    90
##  8    46       2     1    91
##  9    46       2     1    92
## 10    46       2     1    93
## # ... with 353,537 more rows

Adding a year column

The next step of data cleaning is manipulating your variables (columns) to make them more informative.

In this case, you have a session column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session number.

# Add another %>% step to add a year column
votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945)

## # A tibble: 353,547 x 5
##     rcid session  vote ccode  year
##    <dbl>   <dbl> <dbl> <int> <dbl>
##  1    46       2     1     2  1947
##  2    46       2     1    20  1947
##  3    46       2     1    40  1947
##  4    46       2     1    41  1947
##  5    46       2     1    42  1947
##  6    46       2     1    70  1947
##  7    46       2     1    90  1947
##  8    46       2     1    91  1947
##  9    46       2     1    92  1947
## 10    46       2     1    93  1947
## # ... with 353,537 more rows

Adding a country column

The country codes in the ccode column are what’s called Correlates of War codes. This isn’t ideal for an analysis, since you’d like to work with recognizable country names.

You can use the countrycode package to translate. For example:

library(countrycode)

# Translate the country code 2
> countrycode(2, "cown", "country.name")
[1] "United States"

# Translate multiple country codes
> countrycode(c(2, 20, 40), "cown", "country.name")
[1] "United States" "Canada"        "Cuba"

# Load the countrycode package
library(countrycode)

# Convert country code 100
countrycode(100, "cown", "country.name")

## [1] "Colombia"

# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945,
         country = countrycode(ccode, "cown", "country.name"))

## Warning: Problem with `mutate()` input `country`.
## i Some values were not matched unambiguously: 260
## 
## i Input `country` is `countrycode(ccode, "cown", "country.name")`.

## Warning in countrycode(ccode, "cown", "country.name"): Some values were not matched unambiguously: 260

Grouping and summarizing

Summarizing the full dataset

In this analysis, you’re going to focus on “% of votes that are yes” as a metric for the “agreeableness” of countries.

You’ll start by finding this summary for the entire dataset: the fraction of all votes in their history that were “yes”. Note that within your call to summarize(), you can use n() to find the total number of votes and mean(vote == 1) to find the fraction of “yes” votes.

# Print votes_processed
votes_processed

## # A tibble: 353,547 x 6
##     rcid session  vote ccode  year country           
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>             
##  1    46       2     1     2  1947 United States     
##  2    46       2     1    20  1947 Canada            
##  3    46       2     1    40  1947 Cuba              
##  4    46       2     1    41  1947 Haiti             
##  5    46       2     1    42  1947 Dominican Republic
##  6    46       2     1    70  1947 Mexico            
##  7    46       2     1    90  1947 Guatemala         
##  8    46       2     1    91  1947 Honduras          
##  9    46       2     1    92  1947 El Salvador       
## 10    46       2     1    93  1947 Nicaragua         
## # ... with 353,537 more rows

# Find total and fraction of "yes" votes
votes_processed %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## # A tibble: 1 x 2
##    total percent_yes
##    <int>       <dbl>
## 1 353547       0.800

Summarizing by year

The summarize() function is especially useful because it can be used within groups.

For example, you might like to know how much the average “agreeableness” of countries changed from year to year. To examine this, you can use group_by() to perform your summary not for the entire dataset, but within each year.

# Change this code to summarize by year
votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 34 x 3
##     year total percent_yes
##    <dbl> <int>       <dbl>
##  1  1947  2039       0.569
##  2  1949  3469       0.438
##  3  1951  1434       0.585
##  4  1953  1537       0.632
##  5  1955  2169       0.695
##  6  1957  2708       0.609
##  7  1959  4326       0.588
##  8  1961  7482       0.573
##  9  1963  3308       0.729
## 10  1965  4382       0.708
## # ... with 24 more rows

Nice one! The group_by() function must go before your call to summarize() when you’re trying to perform your summary within groups.

Summarizing by country

In the last exercise, you performed a summary of the votes within each year. You could instead summarize() within each country, which would let you compare voting patterns between countries.

# Summarize by country: by_country
by_country <- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## `summarise()` ungrouping output (override with `.groups` argument)

Sorting and filtering summarized data

Sorting by percentage of “yes” votes

Now that you’ve summarized the dataset by country, you can start examining it and answering interesting questions.

For example, you might be especially interested in the countries that voted “yes” least often, or the ones that voted “yes” most often.

# You have the votes summarized by country
by_country <- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## `summarise()` ungrouping output (override with `.groups` argument)

# Print the by_country dataset
by_country

## # A tibble: 200 x 3
##    country           total percent_yes
##    <chr>             <int>       <dbl>
##  1 Afghanistan        2373       0.859
##  2 Albania            1695       0.717
##  3 Algeria            2213       0.899
##  4 Andorra             719       0.638
##  5 Angola             1431       0.924
##  6 Antigua & Barbuda  1302       0.912
##  7 Argentina          2553       0.768
##  8 Armenia             758       0.747
##  9 Australia          2575       0.557
## 10 Austria            2389       0.622
## # ... with 190 more rows

# Sort in ascending order of percent_yes
by_country %>%
  arrange(percent_yes)

## # A tibble: 200 x 3
##    country                          total percent_yes
##    <chr>                            <int>       <dbl>
##  1 Zanzibar                             2       0    
##  2 United States                     2568       0.269
##  3 Palau                              369       0.339
##  4 Israel                            2380       0.341
##  5 <NA>                              1075       0.397
##  6 United Kingdom                    2558       0.417
##  7 France                            2527       0.427
##  8 Micronesia (Federated States of)   724       0.442
##  9 Marshall Islands                   757       0.491
## 10 Belgium                           2568       0.492
## # ... with 190 more rows

# Now sort in descending order
by_country %>%
  arrange(desc(percent_yes))

## # A tibble: 200 x 3
##    country              total percent_yes
##    <chr>                <int>       <dbl>
##  1 São Tomé & Príncipe   1091       0.976
##  2 Seychelles             881       0.975
##  3 Djibouti              1598       0.961
##  4 Guinea-Bissau         1538       0.960
##  5 Timor-Leste            326       0.957
##  6 Mauritius             1831       0.950
##  7 Zimbabwe              1361       0.949
##  8 Comoros               1133       0.947
##  9 United Arab Emirates  1934       0.947
## 10 Mozambique            1701       0.947
## # ... with 190 more rows

Filtering summarized output

In the last exercise, you may have noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. You certainly can’t make any substantial conclusions based on that data!

Typically in a progressive analysis, when you find that a few of your observations have very little data while others have plenty, you set some threshold to filter them out.

# Filter out countries with fewer than 100 votes
by_country %>%
  arrange(percent_yes) %>%
  filter(total >= 100)

## # A tibble: 197 x 3
##    country                          total percent_yes
##    <chr>                            <int>       <dbl>
##  1 United States                     2568       0.269
##  2 Palau                              369       0.339
##  3 Israel                            2380       0.341
##  4 <NA>                              1075       0.397
##  5 United Kingdom                    2558       0.417
##  6 France                            2527       0.427
##  7 Micronesia (Federated States of)   724       0.442
##  8 Marshall Islands                   757       0.491
##  9 Belgium                           2568       0.492
## 10 Canada                            2576       0.508
## # ... with 187 more rows

Visualization with ggplot2

Plotting a line over time

In the last chapter, you learned how to summarize() the votes dataset by year, particularly the percentage of votes in each year that were “yes”.

You’ll now use the ggplot2 package to turn your results into a visualization of the percentage of “yes” votes over time.

# Define by_year
by_year <- votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## `summarise()` ungrouping output (override with `.groups` argument)

# Load the ggplot2 package
library(ggplot2)

# Create line plot
ggplot(by_year, aes(year, percent_yes)) +
  geom_line()

Other ggplot2 layers

A line plot is one way to display this data. You could also choose to display it as a scatter plot, with each year represented as a single point. This requires changing the layer (i.e. geom_line() to geom_point()).

You can also add additional layers to your graph, such as a smoothing curve with geom_smooth().

# Change to scatter plot and add smoothing curve
ggplot(by_year, aes(year, percent_yes)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Visualizing by country

Summarizing by year and country

You’re more interested in trends of voting within specific countries than you are in the overall trend. So instead of summarizing just by year, summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.

# Group by year and country: by_year_country
by_year_country <- votes_processed %>%
  group_by(year, country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

Awesome! Let’s make some plots using this new dataset in the next exercise.

Plotting just the UK over time

Now that you have the percentage of time that each country voted “yes” within each year, you can plot the trend for a particular country. In this case, you’ll look at the trend for just the United Kingdom.

This will involve using filter() on your data before giving it to ggplot2.

# Start with by_year_country dataset
by_year_country <- votes_processed %>%
  group_by(year, country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

# Print by_year_country
by_year_country

## # A tibble: 4,744 x 4
## # Groups:   year [34]
##     year country     total percent_yes
##    <dbl> <chr>       <int>       <dbl>
##  1  1947 Afghanistan    34       0.382
##  2  1947 Argentina      38       0.579
##  3  1947 Australia      38       0.553
##  4  1947 Belarus        38       0.5  
##  5  1947 Belgium        38       0.605
##  6  1947 Bolivia        37       0.595
##  7  1947 Brazil         38       0.658
##  8  1947 Canada         38       0.605
##  9  1947 Chile          38       0.658
## 10  1947 Colombia       35       0.543
## # ... with 4,734 more rows

# Create a filtered version: UK_by_year
UK_by_year <- by_year_country %>%
  filter(country == "United Kingdom")

# Line plot of percent_yes over time for UK only
ggplot(UK_by_year, aes(year, percent_yes)) +
  geom_line()

Plotting multiple countries

Plotting just one country at a time is interesting, but you really want to compare trends between countries. For example, suppose you want to compare voting trends for the United States, the UK, France, and India.

You’ll have to filter to include all four of these countries and use another aesthetic (not just x- and y-axes) to distinguish the countries on the resulting visualization. Instead, you’ll use the color aesthetic to represent different countries.

# Vector of four countries to examine
countries <- c("United States", "United Kingdom",
               "France", "India")

# Filter by_year_country: filtered_4_countries
filtered_4_countries <- by_year_country %>%
  filter(country %in% countries)

# Line plot of % yes in four countries
ggplot(filtered_4_countries, aes(year, percent_yes, color = country)) +
  geom_line()

Faceting by country

Faceting the time series

Now you’ll take a look at six countries. While in the previous exercise you used color to represent distinct countries, this gets a little too crowded with six.

Instead, you will facet, giving each country its own sub-plot. To do so, you add a facet_wrap() step after all of your layers.

# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
               "France", "Japan", "Brazil", "India")

# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
  filter(country %in% countries)

# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ country)

Faceting with free y-axis

In the previous plot, all six graphs had the same axis limits. This made the changes over time hard to examine for plots with relatively little change.

Instead, you may want to let the plot choose a different y-axis for each facet.

# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
               "France", "Japan", "Brazil", "India")

# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
  filter(country %in% countries)

# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ country, scales = "free_y")

Choose your own countries The purpose of an exploratory data analysis is to ask questions and answer them with data. Now it’s your turn to ask the questions.

You’ll choose some countries whose history you are interested in and add them to the graph. If you want to look up the full list of countries, enter by_country$country in the console.

# Add three more countries to this list
countries <- c("United States", "United Kingdom",
               "France", "Japan", "Brazil", "India",
               "Russia", "China", "North Korea")

# Filtered by_year_country: filtered_countries
filtered_countries <- by_year_country %>%
  filter(country %in% countries)

# Line plot of % yes over time faceted by country
ggplot(filtered_countries, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ country, scales = "free_y")

Linear regression

Linear regression on the United States

A linear regression is a model that lets us examine how one variable changes with respect to another by fitting a best fit line. It is done with the lm() function in R.

Here, you’ll fit a linear regression to just the percentage of “yes” votes from the United States.

# Percentage of yes votes from the US by year: US_by_year
US_by_year <- by_year_country %>%
  filter(country == "United States")

# Print the US_by_year data
US_by_year

## # A tibble: 34 x 4
## # Groups:   year [34]
##     year country       total percent_yes
##    <dbl> <chr>         <int>       <dbl>
##  1  1947 United States    38       0.711
##  2  1949 United States    64       0.281
##  3  1951 United States    25       0.4  
##  4  1953 United States    26       0.5  
##  5  1955 United States    37       0.622
##  6  1957 United States    34       0.647
##  7  1959 United States    54       0.426
##  8  1961 United States    75       0.507
##  9  1963 United States    32       0.5  
## 10  1965 United States    41       0.366
## # ... with 24 more rows

# Perform a linear regression of percent_yes by year: US_fit
US_fit <- lm(percent_yes ~ year, US_by_year)

# Perform summary() on the US_fit object
summary(US_fit)

## 
## Call:
## lm(formula = percent_yes ~ year, data = US_by_year)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.222491 -0.080635 -0.008661  0.081948  0.194307 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.6641455  1.8379743   6.890 8.48e-08 ***
## year        -0.0062393  0.0009282  -6.722 1.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1062 on 32 degrees of freedom
## Multiple R-squared:  0.5854, Adjusted R-squared:  0.5724 
## F-statistic: 45.18 on 1 and 32 DF,  p-value: 1.367e-07

Tidying models with broom

Tidying a linear regression model

In the last section, you fit a linear model. Now, you’ll use the tidy() function in the broom package to turn that model into a tidy data frame.

# Load the broom package
library(broom)

# Call the tidy() function on the US_fit object
tidy(US_fit)

## # A tibble: 2 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept) 12.7      1.84          6.89 0.0000000848
## 2 year        -0.00624  0.000928     -6.72 0.000000137

Combining models for multiple countries

One important advantage of changing models to tidied data frames is that they can be combined.

In an earlier section, you fit a linear model to the percentage of “yes” votes for each year in the United States. Now you’ll fit the same model for the United Kingdom and combine the results from both countries.

# Linear regression of percent_yes by year for US
US_by_year <- by_year_country %>%
  filter(country == "United States")
US_fit <- lm(percent_yes ~ year, US_by_year)

# Fit model for the United Kingdom
UK_by_year <- by_year_country %>%
  filter(country == "United Kingdom")
UK_fit <- lm(percent_yes ~ year, UK_by_year)

# Create US_tidied and UK_tidied
US_tidied <- tidy(US_fit)
UK_tidied <- tidy(UK_fit)

# Combine the two tidied models
bind_rows(US_tidied, UK_tidied)

## # A tibble: 4 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept) 12.7      1.84          6.89 0.0000000848
## 2 year        -0.00624  0.000928     -6.72 0.000000137 
## 3 (Intercept) -3.27     1.96         -1.67 0.105       
## 4 year         0.00187  0.000989      1.89 0.0677

Awesome! We can easily compare the two models now.

Nesting for multiple models

Nesting a data frame

Right now, the by_year_country data frame has one row per country-vote pair. So that you can model each country individually, you’re going to “nest” all columns besides country, which will result in a data frame with one row per country. The data for each individual country will then be stored in a list column called data.

# Load the tidyr package
library(tidyr)

# Nest all columns besides country
by_year_country %>%
  nest(-country)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## # A tibble: 200 x 2
##    country     data             
##    <chr>       <list>           
##  1 Afghanistan <tibble [34 x 3]>
##  2 Argentina   <tibble [34 x 3]>
##  3 Australia   <tibble [34 x 3]>
##  4 Belarus     <tibble [34 x 3]>
##  5 Belgium     <tibble [34 x 3]>
##  6 Bolivia     <tibble [34 x 3]>
##  7 Brazil      <tibble [34 x 3]>
##  8 Canada      <tibble [34 x 3]>
##  9 Chile       <tibble [34 x 3]>
## 10 Colombia    <tibble [34 x 3]>
## # ... with 190 more rows

List columns

This “nested” data has an interesting structure. The second column, data, is a list, a type of R object that hasn’t yet come up in this course that allows complicated objects to be stored within each row. This is because each item of the data column is itself a data frame.

# A tibble: 200 × 2
                           country              data
                             <chr>            <list>
1                      Afghanistan <tibble [34 × 3]>
2                        Argentina <tibble [34 × 3]>
3                        Australia <tibble [34 × 3]>
4                          Belarus <tibble [34 × 3]>
5                          Belgium <tibble [34 × 3]>
6  Bolivia, Plurinational State of <tibble [34 × 3]>
7                           Brazil <tibble [34 × 3]>
8                           Canada <tibble [34 × 3]>
9                            Chile <tibble [34 × 3]>
10                        Colombia <tibble [34 × 3]>

You can use nested$data to access this list column and double brackets to access a particular element. For example, nested$data[[1]] would give you the data frame with Afghanistan’s voting history (thepercent_yes per year), since Afghanistan is the first row of the table.

# All countries are nested besides country
nested <- by_year_country %>%
  nest(-country)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

# Print the nested data for Brazil
nested$data[[7]]

## # A tibble: 34 x 3
## # Groups:   year [34]
##     year total percent_yes
##    <dbl> <int>       <dbl>
##  1  1947    38       0.658
##  2  1949    64       0.469
##  3  1951    25       0.64 
##  4  1953    26       0.731
##  5  1955    37       0.730
##  6  1957    34       0.735
##  7  1959    54       0.537
##  8  1961    76       0.553
##  9  1963    32       0.781
## 10  1965    41       0.610
## # ... with 24 more rows

Unnesting

The opposite of the nest() operation is the unnest() operation. This takes each of the data frames in the list column and brings those rows back to the main data frame.

In this exercise, you are just undoing the nest() operation. In the next section, you’ll learn how to fit a model in between these nesting and unnesting steps that makes this process useful.

# All countries are nested besides country
nested <- by_year_country %>%
  nest(-country)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

# Unnest the data column to return it to its original form
nested %>%
  unnest(data)

## # A tibble: 4,744 x 4
##    country      year total percent_yes
##    <chr>       <dbl> <int>       <dbl>
##  1 Afghanistan  1947    34       0.382
##  2 Afghanistan  1949    51       0.608
##  3 Afghanistan  1951    25       0.76 
##  4 Afghanistan  1953    26       0.769
##  5 Afghanistan  1955    37       0.730
##  6 Afghanistan  1957    34       0.529
##  7 Afghanistan  1959    54       0.611
##  8 Afghanistan  1961    76       0.605
##  9 Afghanistan  1963    32       0.781
## 10 Afghanistan  1965    40       0.85 
## # ... with 4,734 more rows

Fitting multiple models

map() applies an operation to each item in a list

Performing linear regression on each nested dataset

Now that you’ve divided the data for each country into a separate dataset in the data column, you need to fit a linear model to each of these datasets.

The map() function from purrr works by applying a formula to each item in a list, where . represents the individual item. For example, you could add one to each of a list of numbers:

map(numbers, ~ 1 + .)

This means that to fit a model to each dataset, you can do:

map(data, ~ lm(percent_yes ~ year, data = .))

where . represents each individual item from the data column in by_year_country. Recall that each item in the data column is a dataset that pertains to a specific country.

# Load tidyr and purrr
library(tidyr)
library(purrr)

# Perform a linear regression on each item in the data column
by_year_country %>%
  nest(-country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)))

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## # A tibble: 200 x 3
##    country     data              model 
##    <chr>       <list>            <list>
##  1 Afghanistan <tibble [34 x 3]> <lm>  
##  2 Argentina   <tibble [34 x 3]> <lm>  
##  3 Australia   <tibble [34 x 3]> <lm>  
##  4 Belarus     <tibble [34 x 3]> <lm>  
##  5 Belgium     <tibble [34 x 3]> <lm>  
##  6 Bolivia     <tibble [34 x 3]> <lm>  
##  7 Brazil      <tibble [34 x 3]> <lm>  
##  8 Canada      <tibble [34 x 3]> <lm>  
##  9 Chile       <tibble [34 x 3]> <lm>  
## 10 Colombia    <tibble [34 x 3]> <lm>  
## # ... with 190 more rows

Tidy each linear regression model

You’ve now performed a linear regression on each nested dataset and have a linear model stored in the list column model. But you can’t recombine the models until you’ve tidied each into a table of coefficients. To do that, you’ll need to use map() one more time and the tidy() function from the broom package.

Recall that you can simply give a function to map() (e.g. map(models, tidy)) in order to apply that function to each item of a list.

# Load the broom package
library(broom)

# Add another mutate that applies tidy() to each model
by_year_country %>%
  nest(-country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy))

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## # A tibble: 200 x 4
##    country     data              model  tidied          
##    <chr>       <list>            <list> <list>          
##  1 Afghanistan <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  2 Argentina   <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  3 Australia   <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  4 Belarus     <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  5 Belgium     <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  6 Bolivia     <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  7 Brazil      <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  8 Canada      <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
##  9 Chile       <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
## 10 Colombia    <tibble [34 x 3]> <lm>   <tibble [2 x 5]>
## # ... with 190 more rows

Unnesting a data frame

You now have a tidied version of each model stored in the tidied column. You want to combine all of those into a large data frame, similar to how you combined the US and UK tidied models earlier. Recall that the unnest() function from tidyr achieves this.

# Add one more step that unnests the tidied column
country_coefficients <- by_year_country %>%
  nest(-country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

# Print the resulting country_coefficients variable
country_coefficients

## # A tibble: 400 x 8
##    country   data        model  term     estimate std.error statistic    p.value
##    <chr>     <list>      <list> <chr>       <dbl>     <dbl>     <dbl>      <dbl>
##  1 Afghanis~ <tibble [3~ <lm>   (Interc~ -1.11e+1  1.47         -7.52    1.44e-8
##  2 Afghanis~ <tibble [3~ <lm>   year      6.01e-3  0.000743      8.09    3.06e-9
##  3 Argentina <tibble [3~ <lm>   (Interc~ -9.46e+0  2.10         -4.50    8.32e-5
##  4 Argentina <tibble [3~ <lm>   year      5.15e-3  0.00106       4.85    3.05e-5
##  5 Australia <tibble [3~ <lm>   (Interc~ -4.55e+0  2.15         -2.12    4.22e-2
##  6 Australia <tibble [3~ <lm>   year      2.57e-3  0.00108       2.37    2.42e-2
##  7 Belarus   <tibble [3~ <lm>   (Interc~ -7.00e+0  1.50         -4.66    5.33e-5
##  8 Belarus   <tibble [3~ <lm>   year      3.91e-3  0.000759      5.15    1.28e-5
##  9 Belgium   <tibble [3~ <lm>   (Interc~ -5.85e+0  1.52         -3.86    5.22e-4
## 10 Belgium   <tibble [3~ <lm>   year      3.20e-3  0.000765      4.19    2.07e-4
## # ... with 390 more rows

Working with many tidy models

Filtering model terms

You currently have both the intercept and slope terms for each by-country model. You’re probably more interested in how each is changing over time, so you want to focus on the slope terms.

# Print the country_coefficients dataset
country_coefficients

## # A tibble: 400 x 8
##    country   data        model  term     estimate std.error statistic    p.value
##    <chr>     <list>      <list> <chr>       <dbl>     <dbl>     <dbl>      <dbl>
##  1 Afghanis~ <tibble [3~ <lm>   (Interc~ -1.11e+1  1.47         -7.52    1.44e-8
##  2 Afghanis~ <tibble [3~ <lm>   year      6.01e-3  0.000743      8.09    3.06e-9
##  3 Argentina <tibble [3~ <lm>   (Interc~ -9.46e+0  2.10         -4.50    8.32e-5
##  4 Argentina <tibble [3~ <lm>   year      5.15e-3  0.00106       4.85    3.05e-5
##  5 Australia <tibble [3~ <lm>   (Interc~ -4.55e+0  2.15         -2.12    4.22e-2
##  6 Australia <tibble [3~ <lm>   year      2.57e-3  0.00108       2.37    2.42e-2
##  7 Belarus   <tibble [3~ <lm>   (Interc~ -7.00e+0  1.50         -4.66    5.33e-5
##  8 Belarus   <tibble [3~ <lm>   year      3.91e-3  0.000759      5.15    1.28e-5
##  9 Belgium   <tibble [3~ <lm>   (Interc~ -5.85e+0  1.52         -3.86    5.22e-4
## 10 Belgium   <tibble [3~ <lm>   year      3.20e-3  0.000765      4.19    2.07e-4
## # ... with 390 more rows

# Filter for only the slope terms
country_coefficients %>%
  filter(term == "year")

## # A tibble: 200 x 8
##    country    data          model  term  estimate std.error statistic    p.value
##    <chr>      <list>        <list> <chr>    <dbl>     <dbl>     <dbl>      <dbl>
##  1 Afghanist~ <tibble [34 ~ <lm>   year   0.00601  0.000743      8.09    3.06e-9
##  2 Argentina  <tibble [34 ~ <lm>   year   0.00515  0.00106       4.85    3.05e-5
##  3 Australia  <tibble [34 ~ <lm>   year   0.00257  0.00108       2.37    2.42e-2
##  4 Belarus    <tibble [34 ~ <lm>   year   0.00391  0.000759      5.15    1.28e-5
##  5 Belgium    <tibble [34 ~ <lm>   year   0.00320  0.000765      4.19    2.07e-4
##  6 Bolivia    <tibble [34 ~ <lm>   year   0.00580  0.000966      6.01    1.06e-6
##  7 Brazil     <tibble [34 ~ <lm>   year   0.00611  0.000817      7.48    1.64e-8
##  8 Canada     <tibble [34 ~ <lm>   year   0.00152  0.000955      1.59    1.22e-1
##  9 Chile      <tibble [34 ~ <lm>   year   0.00678  0.000822      8.24    2.05e-9
## 10 Colombia   <tibble [34 ~ <lm>   year   0.00616  0.000965      6.38    3.58e-7
## # ... with 190 more rows

Filtering for significant countries

Not all slopes are significant, and you can use the p-value to guess which are and which are not.

However, when you have lots of p-values, like one for each country, you run into the problem of multiple hypothesis testing, where you have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value) on a vector of p-values returns a set that you can trust.

Here you’ll add two steps to process the slope_terms dataset: use a mutate to create the new, adjusted p-value column, and filter to filter for those below a .05 threshold.

# Filter for only the slope terms
slope_terms <- country_coefficients %>%
  filter(term == "year")

# Add p.adjusted column, then filter
slope_terms %>%
  mutate(p.adjusted = p.adjust(p.value)) %>%
  filter(p.adjusted < .05)

## # A tibble: 61 x 9
##    country  data    model term  estimate std.error statistic  p.value p.adjusted
##    <chr>    <list>  <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
##  1 Afghani~ <tibbl~ <lm>  year   0.00601  0.000743      8.09  3.06e-9    5.95e-7
##  2 Argenti~ <tibbl~ <lm>  year   0.00515  0.00106       4.85  3.05e-5    4.81e-3
##  3 Belarus  <tibbl~ <lm>  year   0.00391  0.000759      5.15  1.28e-5    2.08e-3
##  4 Belgium  <tibbl~ <lm>  year   0.00320  0.000765      4.19  2.07e-4    3.01e-2
##  5 Bolivia  <tibbl~ <lm>  year   0.00580  0.000966      6.01  1.06e-6    1.88e-4
##  6 Brazil   <tibbl~ <lm>  year   0.00611  0.000817      7.48  1.64e-8    3.12e-6
##  7 Chile    <tibbl~ <lm>  year   0.00678  0.000822      8.24  2.05e-9    3.99e-7
##  8 Colombia <tibbl~ <lm>  year   0.00616  0.000965      6.38  3.58e-7    6.56e-5
##  9 Costa R~ <tibbl~ <lm>  year   0.00654  0.000812      8.05  3.39e-9    6.54e-7
## 10 Cuba     <tibbl~ <lm>  year   0.00461  0.000721      6.40  3.43e-7    6.31e-5
## # ... with 51 more rows

Great work! Notice that there are now only 61 countries with significant trends.

Sorting by slope

Now that you’ve filtered for countries where the trend is probably not due to chance, you may be interested in countries whose percentage of “yes” votes is changing most quickly over time. Thus, you want to find the countries with the highest and lowest slopes; that is, the estimate column.

# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
  filter(term == "year") %>%
  mutate(p.adjusted = p.adjust(p.value)) %>%
  filter(p.adjusted < .05)

# Sort for the countries increasing most quickly
filtered_countries %>%
  arrange(desc(estimate))

## # A tibble: 61 x 9
##    country   data   model term  estimate std.error statistic  p.value p.adjusted
##    <chr>     <list> <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
##  1 South Af~ <tibb~ <lm>  year   0.0119   0.00140       8.47 1.60e- 8    3.05e-6
##  2 Kazakhst~ <tibb~ <lm>  year   0.0110   0.00195       5.62 3.24e- 4    4.51e-2
##  3 Yemen Ar~ <tibb~ <lm>  year   0.0109   0.00159       6.84 1.20e- 6    2.11e-4
##  4 Kyrgyzst~ <tibb~ <lm>  year   0.00973  0.000988      9.84 2.38e- 5    3.78e-3
##  5 Malawi    <tibb~ <lm>  year   0.00908  0.00181       5.02 4.48e- 5    7.03e-3
##  6 Dominica~ <tibb~ <lm>  year   0.00806  0.000914      8.81 5.96e-10    1.17e-7
##  7 Portugal  <tibb~ <lm>  year   0.00802  0.00171       4.68 7.13e- 5    1.10e-2
##  8 Honduras  <tibb~ <lm>  year   0.00772  0.000921      8.38 1.43e- 9    2.81e-7
##  9 Peru      <tibb~ <lm>  year   0.00730  0.000976      7.48 1.65e- 8    3.12e-6
## 10 Nicaragua <tibb~ <lm>  year   0.00708  0.00107       6.60 1.92e- 7    3.55e-5
## # ... with 51 more rows

# Sort for the countries decreasing most quickly
filtered_countries %>%
  arrange(estimate)

## # A tibble: 61 x 9
##    country   data    model term  estimate std.error statistic p.value p.adjusted
##    <chr>     <list>  <lis> <chr>    <dbl>     <dbl>     <dbl>   <dbl>      <dbl>
##  1 South Ko~ <tibbl~ <lm>  year  -0.00921  0.00155      -5.96 1.39e-4  0.0209   
##  2 Israel    <tibbl~ <lm>  year  -0.00685  0.00117      -5.85 1.89e-6  0.000331 
##  3 United S~ <tibbl~ <lm>  year  -0.00624  0.000928     -6.72 1.37e-7  0.0000254
##  4 Belgium   <tibbl~ <lm>  year   0.00320  0.000765      4.19 2.07e-4  0.0301   
##  5 Guinea    <tibbl~ <lm>  year   0.00362  0.000833      4.35 1.87e-4  0.0275   
##  6 Morocco   <tibbl~ <lm>  year   0.00380  0.000860      4.42 1.46e-4  0.0218   
##  7 Belarus   <tibbl~ <lm>  year   0.00391  0.000759      5.15 1.28e-5  0.00208  
##  8 Iran      <tibbl~ <lm>  year   0.00391  0.000856      4.57 6.91e-5  0.0107   
##  9 Congo - ~ <tibbl~ <lm>  year   0.00397  0.000922      4.30 2.27e-4  0.0326   
## 10 Sudan     <tibbl~ <lm>  year   0.00399  0.000961      4.15 2.98e-4  0.0420   
## # ... with 51 more rows

Joining datasets

Joining datasets with inner_join

In the first chapter, you created the votes_processed dataset, containing information about each country’s votes. You’ll now combine that with the new descriptions dataset, which includes topic information about each country, so that you can analyze votes within particular topics.

To do this, you’ll make use of the inner_join() function from dplyr.

# Load dplyr package
library(dplyr)
descriptions <- readRDS("_data/descriptions.rds")

# Print the votes_processed dataset
votes_processed

## # A tibble: 353,547 x 6
##     rcid session  vote ccode  year country           
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>             
##  1    46       2     1     2  1947 United States     
##  2    46       2     1    20  1947 Canada            
##  3    46       2     1    40  1947 Cuba              
##  4    46       2     1    41  1947 Haiti             
##  5    46       2     1    42  1947 Dominican Republic
##  6    46       2     1    70  1947 Mexico            
##  7    46       2     1    90  1947 Guatemala         
##  8    46       2     1    91  1947 Honduras          
##  9    46       2     1    92  1947 El Salvador       
## 10    46       2     1    93  1947 Nicaragua         
## # ... with 353,537 more rows

# Print the descriptions dataset
descriptions

## # A tibble: 2,589 x 10
##     rcid session date                unres      me    nu    di    hr    co    ec
##    <dbl>   <dbl> <dttm>              <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1    46       2 1947-09-04 00:00:00 R/2/299     0     0     0     0     0     0
##  2    47       2 1947-10-05 00:00:00 R/2/355     0     0     0     1     0     0
##  3    48       2 1947-10-06 00:00:00 R/2/461     0     0     0     0     0     0
##  4    49       2 1947-10-06 00:00:00 R/2/463     0     0     0     0     0     0
##  5    50       2 1947-10-06 00:00:00 R/2/465     0     0     0     0     0     0
##  6    51       2 1947-10-02 00:00:00 R/2/561     0     0     0     0     1     0
##  7    52       2 1947-11-06 00:00:00 R/2/650     0     0     0     0     1     0
##  8    53       2 1947-11-06 00:00:00 R/2/651     0     0     0     0     1     0
##  9    54       2 1947-11-06 00:00:00 R/2/651     0     0     0     0     1     0
## 10    55       2 1947-11-06 00:00:00 R/2/667     0     0     0     0     1     0
## # ... with 2,579 more rows

# Join them together based on the "rcid" and "session" columns
votes_joined <- votes_processed %>%
  inner_join(descriptions, by = c("rcid", "session"))

Filtering the joined dataset

There are six columns in the descriptions dataset (and therefore in the new joined dataset) that describe the topic of a resolution:

me: Palestinian conflict
nu: Nuclear weapons and nuclear material
di: Arms control and disarmament
hr: Human rights
co: Colonialism
ec: Economic development

Each contains a 1 if the resolution is related to this topic and a 0 otherwise.

# Filter for votes related to colonialism
votes_joined %>%
  filter(co == 1)

## # A tibble: 60,962 x 14
##     rcid session  vote ccode  year country date                unres    me    nu
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>   <dttm>              <chr> <dbl> <dbl>
##  1    51       2     3     2  1947 United~ 1947-10-02 00:00:00 R/2/~     0     0
##  2    51       2     3    20  1947 Canada  1947-10-02 00:00:00 R/2/~     0     0
##  3    51       2     2    40  1947 Cuba    1947-10-02 00:00:00 R/2/~     0     0
##  4    51       2     1    41  1947 Haiti   1947-10-02 00:00:00 R/2/~     0     0
##  5    51       2     3    42  1947 Domini~ 1947-10-02 00:00:00 R/2/~     0     0
##  6    51       2     2    70  1947 Mexico  1947-10-02 00:00:00 R/2/~     0     0
##  7    51       2     2    90  1947 Guatem~ 1947-10-02 00:00:00 R/2/~     0     0
##  8    51       2     2    92  1947 El Sal~ 1947-10-02 00:00:00 R/2/~     0     0
##  9    51       2     3    93  1947 Nicara~ 1947-10-02 00:00:00 R/2/~     0     0
## 10    51       2     2    95  1947 Panama  1947-10-02 00:00:00 R/2/~     0     0
## # ... with 60,952 more rows, and 4 more variables: di <dbl>, hr <dbl>,
## #   co <dbl>, ec <dbl>

Visualizing colonialism votes

In an earlier exercise, you graphed the percentage of votes each year where the US voted “yes”. Now you’ll create that same graph, but only for votes related to colonialism.

# Load the ggplot2 package
library(ggplot2)

# Filter, then summarize by year: US_co_by_year
US_co_by_year <- votes_joined %>%
  filter(country == "United States", co == 1) %>%
  group_by(year) %>%
  summarize(percent_yes = mean(vote == 1))

## `summarise()` ungrouping output (override with `.groups` argument)

# Graph the % of "yes" votes over time
ggplot(US_co_by_year, aes(year, percent_yes)) +
  geom_line()

Tidy data

Using gather to tidy a dataset

In order to represent the joined vote-topic data in a tidy form so we can analyze and graph by topic, we need to transform the data so that each row has one combination of country-vote-topic. This will change the data from having six columns (me, nu, di, hr, co, ec) to having two columns (topic and has_topic).

# Load the tidyr package
library(tidyr)

# Gather the six me/nu/di/hr/co/ec columns
votes_joined %>%
  gather(topic, has_topic, me:ec)

## # A tibble: 2,121,282 x 10
##     rcid session  vote ccode  year country date                unres topic
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>   <dttm>              <chr> <chr>
##  1    46       2     1     2  1947 United~ 1947-09-04 00:00:00 R/2/~ me   
##  2    46       2     1    20  1947 Canada  1947-09-04 00:00:00 R/2/~ me   
##  3    46       2     1    40  1947 Cuba    1947-09-04 00:00:00 R/2/~ me   
##  4    46       2     1    41  1947 Haiti   1947-09-04 00:00:00 R/2/~ me   
##  5    46       2     1    42  1947 Domini~ 1947-09-04 00:00:00 R/2/~ me   
##  6    46       2     1    70  1947 Mexico  1947-09-04 00:00:00 R/2/~ me   
##  7    46       2     1    90  1947 Guatem~ 1947-09-04 00:00:00 R/2/~ me   
##  8    46       2     1    91  1947 Hondur~ 1947-09-04 00:00:00 R/2/~ me   
##  9    46       2     1    92  1947 El Sal~ 1947-09-04 00:00:00 R/2/~ me   
## 10    46       2     1    93  1947 Nicara~ 1947-09-04 00:00:00 R/2/~ me   
## # ... with 2,121,272 more rows, and 1 more variable: has_topic <dbl>

# Perform gather again, then filter
votes_gathered <- votes_joined %>%
  gather(topic, has_topic, me:ec) %>%
  filter(has_topic == 1)

Recoding the topics

There’s one more step of data cleaning to make this more interpretable. Right now, topics are represented by two-letter codes:

me: Palestinian conflict
nu: Nuclear weapons and nuclear material
di: Arms control and disarmament
hr: Human rights
co: Colonialism
ec: Economic development

So that you can interpret the data more easily, recode the data to replace these codes with their full name. You can do that with dplyr’s recode() function, which replaces values with ones you specify:

example <- c("apple", "banana", "apple", "orange")
recode(example,
       apple = "plum",
       banana = "grape")

# Replace the two-letter codes in topic: votes_tidied
votes_tidied <- votes_gathered %>%
  mutate(topic = recode(topic,
                        me = "Palestinian conflict",
                        nu = "Nuclear weapons and nuclear material",
                        di = "Arms control and disarmament",
                        hr = "Human rights",
                        co = "Colonialism",
                        ec = "Economic development"))

Summarize by country, year, and topic

In previous exercises, you summarized the votes dataset by country, by year, and by country-year combination.

Now that you have topic as an additional variable, you can summarize the votes for each combination of country, year, and topic (e.g. for the United States in 2013 on the topic of nuclear weapons.)

# Print votes_tidied
votes_tidied

## # A tibble: 350,032 x 10
##     rcid session  vote ccode  year country date                unres topic
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>   <dttm>              <chr> <chr>
##  1    77       2     1     2  1947 United~ 1947-11-06 00:00:00 R/2/~ Pale~
##  2    77       2     1    20  1947 Canada  1947-11-06 00:00:00 R/2/~ Pale~
##  3    77       2     3    40  1947 Cuba    1947-11-06 00:00:00 R/2/~ Pale~
##  4    77       2     1    41  1947 Haiti   1947-11-06 00:00:00 R/2/~ Pale~
##  5    77       2     1    42  1947 Domini~ 1947-11-06 00:00:00 R/2/~ Pale~
##  6    77       2     2    70  1947 Mexico  1947-11-06 00:00:00 R/2/~ Pale~
##  7    77       2     1    90  1947 Guatem~ 1947-11-06 00:00:00 R/2/~ Pale~
##  8    77       2     2    91  1947 Hondur~ 1947-11-06 00:00:00 R/2/~ Pale~
##  9    77       2     2    92  1947 El Sal~ 1947-11-06 00:00:00 R/2/~ Pale~
## 10    77       2     1    93  1947 Nicara~ 1947-11-06 00:00:00 R/2/~ Pale~
## # ... with 350,022 more rows, and 1 more variable: has_topic <dbl>

# Summarize the percentage "yes" per country-year-topic
by_country_year_topic <- votes_tidied %>%
  group_by(country, year, topic) %>%
  summarize(total = n(), percent_yes = mean(vote == 1)) %>%
  ungroup()

## `summarise()` regrouping output by 'country', 'year' (override with `.groups` argument)

# Print by_country_year_topic
by_country_year_topic

## # A tibble: 26,968 x 5
##    country      year topic                                total percent_yes
##    <chr>       <dbl> <chr>                                <int>       <dbl>
##  1 Afghanistan  1947 Colonialism                              8       0.5  
##  2 Afghanistan  1947 Economic development                     1       0    
##  3 Afghanistan  1947 Human rights                             1       0    
##  4 Afghanistan  1947 Palestinian conflict                     6       0    
##  5 Afghanistan  1949 Arms control and disarmament             3       0    
##  6 Afghanistan  1949 Colonialism                             22       0.864
##  7 Afghanistan  1949 Economic development                     1       1    
##  8 Afghanistan  1949 Human rights                             3       0    
##  9 Afghanistan  1949 Nuclear weapons and nuclear material     3       0    
## 10 Afghanistan  1949 Palestinian conflict                    11       0.818
## # ... with 26,958 more rows

Visualizing trends in topics for one country

You can now visualize the trends in percentage “yes” over time for all six topics side-by-side. Here, you’ll visualize them just for the United States.

# Load the ggplot2 package
library(ggplot2)

# Filter by_country_year_topic for just the US
US_by_country_year_topic <- by_country_year_topic %>%
  filter(country == "United States")

# Plot % yes over time for the US, faceting by topic
ggplot(US_by_country_year_topic, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ topic)

Tidy modeling by topic and country

Nesting by topic and country

In the last chapter, you constructed a linear model for each country by nesting the data in each country, fitting a model to each dataset, then tidying each model with broom and unnesting the coefficients. The code looked something like this:

country_coefficients <- by_year_country %>%
  nest(-country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied)

Now, you’ll again be modeling change in “percentage” yes over time, but instead of fitting one model for each country, you’ll fit one for each combination of country and topic.

# Load purrr, tidyr, and broom
library(purrr)
library(tidyr)
library(broom)

# Print by_country_year_topic
by_country_year_topic

## # A tibble: 26,968 x 5
##    country      year topic                                total percent_yes
##    <chr>       <dbl> <chr>                                <int>       <dbl>
##  1 Afghanistan  1947 Colonialism                              8       0.5  
##  2 Afghanistan  1947 Economic development                     1       0    
##  3 Afghanistan  1947 Human rights                             1       0    
##  4 Afghanistan  1947 Palestinian conflict                     6       0    
##  5 Afghanistan  1949 Arms control and disarmament             3       0    
##  6 Afghanistan  1949 Colonialism                             22       0.864
##  7 Afghanistan  1949 Economic development                     1       1    
##  8 Afghanistan  1949 Human rights                             3       0    
##  9 Afghanistan  1949 Nuclear weapons and nuclear material     3       0    
## 10 Afghanistan  1949 Palestinian conflict                    11       0.818
## # ... with 26,958 more rows

# Fit model on the by_country_year_topic dataset
country_topic_coefficients <- by_country_year_topic %>%
  nest(-country, -topic) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning: Problem with `mutate()` input `tidied`.
## i essentially perfect fit: summary may be unreliable
## i Input `tidied` is `map(model, tidy)`.

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

# Print country_topic_coefficients
country_topic_coefficients

## # A tibble: 2,384 x 9
##    country  topic      data    model term   estimate std.error statistic p.value
##    <chr>    <chr>      <list>  <lis> <chr>     <dbl>     <dbl>     <dbl>   <dbl>
##  1 Afghani~ Coloniali~ <tibbl~ <lm>  (Inte~ -9.20e+0  1.96         -4.70 4.76e-5
##  2 Afghani~ Coloniali~ <tibbl~ <lm>  year    5.11e-3  0.000989      5.17 1.23e-5
##  3 Afghani~ Economic ~ <tibbl~ <lm>  (Inte~ -1.15e+1  3.62         -3.17 3.49e-3
##  4 Afghani~ Economic ~ <tibbl~ <lm>  year    6.24e-3  0.00183       3.42 1.85e-3
##  5 Afghani~ Human rig~ <tibbl~ <lm>  (Inte~ -7.27e+0  4.37         -1.66 1.06e-1
##  6 Afghani~ Human rig~ <tibbl~ <lm>  year    4.08e-3  0.00221       1.85 7.43e-2
##  7 Afghani~ Palestini~ <tibbl~ <lm>  (Inte~ -1.33e+1  3.57         -3.73 8.66e-4
##  8 Afghani~ Palestini~ <tibbl~ <lm>  year    7.17e-3  0.00180       3.98 4.42e-4
##  9 Afghani~ Arms cont~ <tibbl~ <lm>  (Inte~ -1.38e+1  4.13         -3.33 2.53e-3
## 10 Afghani~ Arms cont~ <tibbl~ <lm>  year    7.37e-3  0.00208       3.54 1.49e-3
## # ... with 2,374 more rows

Great work! You can ignore the warning messages in the console for now.

Interpreting tidy models

Now you have both the slope and intercept terms for each model. Just as you did in the last chapter with the tidied coefficients, you’ll need to filter for only the slope terms.

You’ll also have to extract only cases that are statistically significant, which means adjusting the p-value for the number of models, and then filtering to include only significant changes.

# Create country_topic_filtered
country_topic_filtered <- country_topic_coefficients %>%
  filter(term == "year") %>%
  mutate(p.adjusted = p.adjust(p.value)) %>%
  filter(p.adjusted < .05)

Checking models visually

Vanuatu (an island nation in the Pacific Ocean) sharply changed its pattern of voting on the topic of Palestinian conflict.

Let’s examine this country’s voting patterns more closely. Recall that the by_country_year_topic dataset contained one row for each combination of country, year, and topic. You can use that to create a plot of Vanuatu’s voting, faceted by topic.

# Create vanuatu_by_country_year_topic
vanuatu_by_country_year_topic <- by_country_year_topic %>%
  filter(country == "Vanuatu")

# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ topic)

Phenomenal work! Congratulations on finishing the course!