Exploratory Data Analysis in R

Libraries

library(tidyverse)
library(openintro)
library(assertive)
library(fst)
library(broom)
library(ggfortify)
library(yardstick)
library(moderndive)
library(plot3D)
library(magrittr)
library(infer)
library(anytime)
library(lubridate)
library(ggridges)

Data

cars<- read.csv("cars.csv")
comics <- read.csv("comics.csv")
immigrant <- read.csv("immigrant.csv")
life <- read.csv("life.csv")
names <- read.csv("names.csv")
income <- read.csv("income.csv")
votes <- read_rds("votes.rds")
descriptions<- read_rds("descriptions.rds")

0. Course Description

When your dataset is represented as a table or a database, it’s difficult to observe much about it beyond its size and the types of variables it contains. In this course, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual? By the end of the course, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.

1. Exploring Categorical Data

In this chapter, you will learn how to create graphical and numerical summaries of two categorical variables.

1.1 Exploring Categorical Data

In this chapter you’ll continue working with the comics dataset introduced in the video. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.

Let’s start by creating a contingency table, which is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.

# Check levels of align
levels(comics$align)

## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

# Check the levels of gender
levels(comics$gender)

## [1] "1997, December" "Female"         "Male"           "Other"

which(grepl("1997, December", comics$gender))

## [1] 20261

# Create a 2-way contingency table
table(comics$align, comics$gender)

##                     
##                      1997, December Female Male Other
##   Bad                             0   1550 7497    32
##   Good                            0   2472 4786    17
##   Neutral                         0    831 1794    17
##   Reformed Criminals              0      1    2     0

tab <- table(comics$align, comics$gender)
tab <- tab[ ,2:4]
tab

##                     
##                      Female Male Other
##   Bad                  1550 7497    32
##   Good                 2472 4786    17
##   Neutral               831 1794    17
##   Reformed Criminals      1    2     0

1.2 Dropping levels

The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels(). This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.

# Remove align level
comics_filtered <- comics %>%
  filter(align != "Reformed Criminals") %>%
  droplevels()

# See the result
head(comics_filtered)

##                                    name      id   align        eye       hair
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes Brown Hair
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes Black Hair
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes Black Hair
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes Blond Hair
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes    No Hair
##   gender  gsm             alive appearances first_appear publisher
## 1   Male <NA> Living Characters        4043       Aug-62    marvel
## 2   Male <NA> Living Characters        3360       Mar-41    marvel
## 3   Male <NA> Living Characters        3061       Oct-74    marvel
## 4   Male <NA> Living Characters        2961       Mar-63    marvel
## 5   Male <NA> Living Characters        2258       Nov-50    marvel
## 6   Male <NA> Living Characters        2255       Nov-61    marvel

1.3 Side-by-side barcharts

While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.This shows that there can often be two or more options for presenting the same data. Passing the argument position = “dodge” to geom_bar() says that you want a side-by-side (i.e. not stacked) barchart.

# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

1.4 Counts vs. proportions

The following code generates tables of joint and conditional proportions, respectively:

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()

# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill") +
  ylab("proportion")

1.5 Marginal barchart

If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a barchart for just that single variable. You can improve the interpretability of the plot, though, by implementing some sensible ordering.

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

2. Exploring Numerical Data

In this chapter, you will learn how to graphically summarize numerical data.

2.1 Faceted histogram

In this chapter, you’ll be working with the cars dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. You will investigate the distribution of mileage across a categorical variable, but before you get there, you’ll want to familiarize yourself with the dataset.

# Learn data structure
str(cars)

## 'data.frame':    428 obs. of  20 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ name       : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~ suv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 14 rows containing non-finite values (stat_bin).

2.2 Boxplots and density plots

The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this exercise you’ll try your hand at two alternatives: the box plot and the density plot.

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()

## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

## Warning: Removed 11 rows containing non-finite values (stat_density).

2.3 Marginal and conditional histograms

Now, turn your attention to a new variable: horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.

# Create hist of horsepwr
cars %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram() +
  ggtitle("Distribution of horsepower for all cars")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create hist of horsepwr for affordable cars
cars %>%
  filter(msrp < 25000) %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Distribution of horsepower for affordable cars")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

2.4 Three binwidths

Before you take these plots for granted, it’s a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes.

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("Distribution of horsepower for affordable cars")

# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("Distribution of horsepower for affordable cars")

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("Distribution of horsepower for affordable cars")

2.5 Box plots for outliers

n addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
    geom_boxplot()

2.6 Plot selection

Consider two other columns in the cars dataset: city_mpg and width. Which is the most appropriate plot for displaying the important features of their distributions? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

# Try plotting a histogram of the two variables to see their distributions. Which has outliers and which doesn't? The variable with outliers should be displayed as a box plot and the variable without outliers should be displayed as a density plot.
# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

# Create plot of width
cars %>% 
  ggplot(aes(x = width)) +
  geom_density()

## Warning: Removed 28 rows containing non-finite values (stat_density).

2.7 3Three variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv) +
  ggtitle("Hists using hwy mileage and ncyl")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite values (stat_bin).

3. Numerical Summaries%

Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data.

3.1 Calculate center measures

library(gapminder)
head(gapminder)

## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year==2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean = mean(lifeExp),
            median = median(lifeExp))

## # A tibble: 5 × 3
##   continent  mean median
##   <fct>     <dbl>  <dbl>
## 1 Africa     54.8   52.9
## 2 Americas   73.6   72.9
## 3 Asia       70.7   72.4
## 4 Europe     77.6   78.6
## 5 Oceania    80.7   80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

3.2 Calculate spread measures

Let’s extend the powerful group_by() and summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())

## # A tibble: 5 × 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

3.3 Choose measures for center and spread

Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them.

# Compute stats for lifeExp in Americas
gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))

## # A tibble: 1 × 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))

## # A tibble: 1 × 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531  26702008.

3.4 Transformations

Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.

Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

3.5 Identify outliers

Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min() or filter(), then proceed to building a plot with that country removed.

# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>% 
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

4. Case Study

Apply what you’ve learned to explore and summarize a real world dataset in this case study of email spam.

4.1 Data cleaning and summarizing with dplyr

The best way to learn data wrangling skills is to apply them to a specific case study. Here you’ll learn how to clean and filter the United Nations voting dataset using the dplyr package, and how to summarize it into smaller, interpretable units.

library(countrycode)
# Convert country code 100
countrycode(100, "cown", "country.name")

## [1] "Colombia"

# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945,
         country = countrycode(ccode, "cown", "country.name"))

## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 260

# Find total and fraction of "yes" votes
votes_processed %>% summarize(total = n(),
percent_yes = mean(vote==1))

## # A tibble: 1 × 2
##    total percent_yes
##    <int>       <dbl>
## 1 353547       0.800

# Change this code to summarize by year
votes_processed %>%
group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## # A tibble: 34 × 3
##     year total percent_yes
##    <dbl> <int>       <dbl>
##  1  1947  2039       0.569
##  2  1949  3469       0.438
##  3  1951  1434       0.585
##  4  1953  1537       0.632
##  5  1955  2169       0.695
##  6  1957  2708       0.609
##  7  1959  4326       0.588
##  8  1961  7482       0.573
##  9  1963  3308       0.729
## 10  1965  4382       0.708
## # … with 24 more rows

# Summarize by country: by_country
by_country<- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))
# Now sort in descending order
by_country  %>% arrange(desc(percent_yes))

## # A tibble: 200 × 3
##    country              total percent_yes
##    <chr>                <int>       <dbl>
##  1 São Tomé & Príncipe   1091       0.976
##  2 Seychelles             881       0.975
##  3 Djibouti              1598       0.961
##  4 Guinea-Bissau         1538       0.960
##  5 Timor-Leste            326       0.957
##  6 Mauritius             1831       0.950
##  7 Zimbabwe              1361       0.949
##  8 Comoros               1133       0.947
##  9 United Arab Emirates  1934       0.947
## 10 Mozambique            1701       0.947
## # … with 190 more rows

# Filter out countries with fewer than 100 votes
by_country %>%
  arrange(percent_yes) %>%filter(total > 100)

## # A tibble: 197 × 3
##    country                          total percent_yes
##    <chr>                            <int>       <dbl>
##  1 United States                     2568       0.269
##  2 Palau                              369       0.339
##  3 Israel                            2380       0.341
##  4 <NA>                              1075       0.397
##  5 United Kingdom                    2558       0.417
##  6 France                            2527       0.427
##  7 Micronesia (Federated States of)   724       0.442
##  8 Marshall Islands                   757       0.491
##  9 Belgium                           2568       0.492
## 10 Canada                            2576       0.508
## # … with 187 more rows

4.2 Data visualization with ggplot2

Once you’ve cleaned and summarized data, you’ll want to visualize them to understand trends and extract insights. Here you’ll use the ggplot2 package to explore trends in United Nations voting within each country over time.

by_year <- votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

# Create line plot
ggplot(by_year, aes(year, percent_yes)) + geom_line()

# Change to scatter plot and add smoothing curve
ggplot(by_year, aes(year, percent_yes)) +
  geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Group by year and country: by_year_country
by_year_country <- votes_processed %>%
  group_by(year, country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

# Create a filtered version: UK_by_year
UK_by_year <- by_year_country %>% filter(country == "United Kingdom")
# Line plot of percent_yes over time for UK only
ggplot(UK_by_year, aes(year, percent_yes)) + 
  geom_line()

# Vector of four countries to examine
countries <- c("United States", "United Kingdom",
               "France", "India")

# Filter by_year_country: f iltered_4_countries
filtered_4_countries <- by_year_country %>% filter(country %in% countries)
# Line plot of % yes in four countries
ggplot(filtered_4_countries, aes(year, percent_yes, color = country)) +
  geom_line()

# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
               "France", "Japan", "Brazil", "India")
# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>% filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ country, scale = "free_y")

4.3 Tidy modeling with broom

While visualization helps you understand one country at a time, statistical modeling lets you quantify trends across many countries and interpret them together. Here you’ll learn to use the tidyr, purrr, and broom packages to fit linear models to each country, and understand and compare their outputs.

# Percentage of yes votes from the US by year: US_by_year
US_by_year <- by_year_country %>%
  filter(country == "United States")
# Print the US_by_year data
US_by_year

## # A tibble: 34 × 4
## # Groups:   year [34]
##     year country       total percent_yes
##    <dbl> <chr>         <int>       <dbl>
##  1  1947 United States    38       0.711
##  2  1949 United States    64       0.281
##  3  1951 United States    25       0.4  
##  4  1953 United States    26       0.5  
##  5  1955 United States    37       0.622
##  6  1957 United States    34       0.647
##  7  1959 United States    54       0.426
##  8  1961 United States    75       0.507
##  9  1963 United States    32       0.5  
## 10  1965 United States    41       0.366
## # … with 24 more rows

# Perform a linear regression of percent_yes by year: US_fit
US_fit <- lm(percent_yes ~ year, US_by_year)
# Perform summary() on the US_fit object
summary(US_fit)

## 
## Call:
## lm(formula = percent_yes ~ year, data = US_by_year)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.222491 -0.080635 -0.008661  0.081948  0.194307 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.6641455  1.8379743   6.890 8.48e-08 ***
## year        -0.0062393  0.0009282  -6.722 1.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1062 on 32 degrees of freedom
## Multiple R-squared:  0.5854, Adjusted R-squared:  0.5724 
## F-statistic: 45.18 on 1 and 32 DF,  p-value: 1.367e-07

# Call the tidy() function on the US_fit object
tidy(US_fit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept) 12.7      1.84          6.89 0.0000000848
## 2 year        -0.00624  0.000928     -6.72 0.000000137

# Fit model for the United Kingdom
UK_by_year <- by_year_country %>%
  filter(country == "United Kingdom")
UK_fit <- lm(percent_yes ~ year, UK_by_year)
# Create US_tidied and UK_tidied
US_tidied <- tidy(US_fit)
UK_tidied <- tidy(UK_fit)
# Combine the two tidied models
bind_rows(US_tidied, UK_tidied)

## # A tibble: 4 × 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept) 12.7      1.84          6.89 0.0000000848
## 2 year        -0.00624  0.000928     -6.72 0.000000137 
## 3 (Intercept) -3.27     1.96         -1.67 0.105       
## 4 year         0.00187  0.000989      1.89 0.0677

4.4 Nesting a data frame

Right now, the by_year_country data frame has one row per country-vote pair. So that you can model each country individually, you’re going to “nest” all columns besides country, which will result in a data frame with one row per country. The data for each individual country will then be stored in a list column called data.

# Nest all columns besides country
by_year_country %>%
nest(-country)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## # A tibble: 200 × 2
##    country     data                 
##    <chr>       <list>               
##  1 Afghanistan <grouped_df [34 × 3]>
##  2 Argentina   <grouped_df [34 × 3]>
##  3 Australia   <grouped_df [34 × 3]>
##  4 Belarus     <grouped_df [34 × 3]>
##  5 Belgium     <grouped_df [34 × 3]>
##  6 Bolivia     <grouped_df [34 × 3]>
##  7 Brazil      <grouped_df [34 × 3]>
##  8 Canada      <grouped_df [34 × 3]>
##  9 Chile       <grouped_df [34 × 3]>
## 10 Colombia    <grouped_df [34 × 3]>
## # … with 190 more rows

# All countries are nested besides country
nested <- by_year_country %>%
  nest(-country)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

# Unnest the data column to return it to its original form
unnest(nested)

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(data)`

## # A tibble: 4,744 × 4
##    country      year total percent_yes
##    <chr>       <dbl> <int>       <dbl>
##  1 Afghanistan  1947    34       0.382
##  2 Afghanistan  1949    51       0.608
##  3 Afghanistan  1951    25       0.76 
##  4 Afghanistan  1953    26       0.769
##  5 Afghanistan  1955    37       0.730
##  6 Afghanistan  1957    34       0.529
##  7 Afghanistan  1959    54       0.611
##  8 Afghanistan  1961    76       0.605
##  9 Afghanistan  1963    32       0.781
## 10 Afghanistan  1965    40       0.85 
## # … with 4,734 more rows

# Perform a linear regression on each item in the data column
by_year_country %>%
  nest(-country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)))

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## # A tibble: 200 × 3
##    country     data                  model 
##    <chr>       <list>                <list>
##  1 Afghanistan <grouped_df [34 × 3]> <lm>  
##  2 Argentina   <grouped_df [34 × 3]> <lm>  
##  3 Australia   <grouped_df [34 × 3]> <lm>  
##  4 Belarus     <grouped_df [34 × 3]> <lm>  
##  5 Belgium     <grouped_df [34 × 3]> <lm>  
##  6 Bolivia     <grouped_df [34 × 3]> <lm>  
##  7 Brazil      <grouped_df [34 × 3]> <lm>  
##  8 Canada      <grouped_df [34 × 3]> <lm>  
##  9 Chile       <grouped_df [34 × 3]> <lm>  
## 10 Colombia    <grouped_df [34 × 3]> <lm>  
## # … with 190 more rows

# Add another mutate that applies tidy() to each model
by_year_country %>%
  nest(-country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .))) %>%
  mutate(tidied=map(model, tidy))

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## # A tibble: 200 × 4
##    country     data                  model  tidied          
##    <chr>       <list>                <list> <list>          
##  1 Afghanistan <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  2 Argentina   <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  3 Australia   <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  4 Belarus     <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  5 Belgium     <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  6 Bolivia     <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  7 Brazil      <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  8 Canada      <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
##  9 Chile       <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
## 10 Colombia    <grouped_df [34 × 3]> <lm>   <tibble [2 × 5]>
## # … with 190 more rows

# Add one more step that unnests the tidied column
country_coefficients <- by_year_country %>%
  nest(-country) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
         unnest(tidied)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

# Print the resulting country_coefficients variable
country_coefficients

## # A tibble: 400 × 8
##    country     data        model  term    estimate std.error statistic   p.value
##    <chr>       <list>      <list> <chr>      <dbl>     <dbl>     <dbl>     <dbl>
##  1 Afghanistan <grouped_d… <lm>   (Inter… -1.11e+1  1.47         -7.52   1.44e-8
##  2 Afghanistan <grouped_d… <lm>   year     6.01e-3  0.000743      8.09   3.06e-9
##  3 Argentina   <grouped_d… <lm>   (Inter… -9.46e+0  2.10         -4.50   8.32e-5
##  4 Argentina   <grouped_d… <lm>   year     5.15e-3  0.00106       4.85   3.05e-5
##  5 Australia   <grouped_d… <lm>   (Inter… -4.55e+0  2.15         -2.12   4.22e-2
##  6 Australia   <grouped_d… <lm>   year     2.57e-3  0.00108       2.37   2.42e-2
##  7 Belarus     <grouped_d… <lm>   (Inter… -7.00e+0  1.50         -4.66   5.33e-5
##  8 Belarus     <grouped_d… <lm>   year     3.91e-3  0.000759      5.15   1.28e-5
##  9 Belgium     <grouped_d… <lm>   (Inter… -5.85e+0  1.52         -3.86   5.22e-4
## 10 Belgium     <grouped_d… <lm>   year     3.20e-3  0.000765      4.19   2.07e-4
## # … with 390 more rows

4.5 Filtering for significant countries

Not all slopes are significant, and you can use the p-value to guess which are and which are not. However, when you have lots of p-values, like one for each country, you run into the problem of multiple hypothesis testing, where you have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value) on a vector of p-values returns a set that you can trust.

# Filter for only the slope terms
country_coefficients %>% 
filter(term == "year")

## # A tibble: 200 × 8
##    country     data          model  term  estimate std.error statistic   p.value
##    <chr>       <list>        <list> <chr>    <dbl>     <dbl>     <dbl>     <dbl>
##  1 Afghanistan <grouped_df … <lm>   year   0.00601  0.000743      8.09   3.06e-9
##  2 Argentina   <grouped_df … <lm>   year   0.00515  0.00106       4.85   3.05e-5
##  3 Australia   <grouped_df … <lm>   year   0.00257  0.00108       2.37   2.42e-2
##  4 Belarus     <grouped_df … <lm>   year   0.00391  0.000759      5.15   1.28e-5
##  5 Belgium     <grouped_df … <lm>   year   0.00320  0.000765      4.19   2.07e-4
##  6 Bolivia     <grouped_df … <lm>   year   0.00580  0.000966      6.01   1.06e-6
##  7 Brazil      <grouped_df … <lm>   year   0.00611  0.000817      7.48   1.64e-8
##  8 Canada      <grouped_df … <lm>   year   0.00152  0.000955      1.59   1.22e-1
##  9 Chile       <grouped_df … <lm>   year   0.00678  0.000822      8.24   2.05e-9
## 10 Colombia    <grouped_df … <lm>   year   0.00616  0.000965      6.38   3.58e-7
## # … with 190 more rows

# Filter for only the slope terms
slope_terms <- country_coefficients %>%
  filter(term == "year") 

# Add p.adjusted column, then filter
slope_terms %>%  mutate(p.adjusted = p.adjust(p.value)) %>% 
filter(p.adjusted < 0.05)

## # A tibble: 61 × 9
##    country     data  model term  estimate std.error statistic p.value p.adjusted
##    <chr>       <lis> <lis> <chr>    <dbl>     <dbl>     <dbl>   <dbl>      <dbl>
##  1 Afghanistan <gro… <lm>  year   0.00601  0.000743      8.09 3.06e-9    5.95e-7
##  2 Argentina   <gro… <lm>  year   0.00515  0.00106       4.85 3.05e-5    4.81e-3
##  3 Belarus     <gro… <lm>  year   0.00391  0.000759      5.15 1.28e-5    2.08e-3
##  4 Belgium     <gro… <lm>  year   0.00320  0.000765      4.19 2.07e-4    3.01e-2
##  5 Bolivia     <gro… <lm>  year   0.00580  0.000966      6.01 1.06e-6    1.88e-4
##  6 Brazil      <gro… <lm>  year   0.00611  0.000817      7.48 1.64e-8    3.12e-6
##  7 Chile       <gro… <lm>  year   0.00678  0.000822      8.24 2.05e-9    3.99e-7
##  8 Colombia    <gro… <lm>  year   0.00616  0.000965      6.38 3.58e-7    6.56e-5
##  9 Costa Rica  <gro… <lm>  year   0.00654  0.000812      8.05 3.39e-9    6.54e-7
## 10 Cuba        <gro… <lm>  year   0.00461  0.000721      6.40 3.43e-7    6.31e-5
## # … with 51 more rows

# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
  filter(term == "year") %>%
  mutate(p.adjusted = p.adjust(p.value)) %>%
  filter(p.adjusted < .05) 
# Sort for the countries increasing most quickly
filtered_countries %>% arrange(estimate )

## # A tibble: 61 × 9
##    country   data    model term  estimate std.error statistic p.value p.adjusted
##    <chr>     <list>  <lis> <chr>    <dbl>     <dbl>     <dbl>   <dbl>      <dbl>
##  1 South Ko… <group… <lm>  year  -0.00921  0.00155      -5.96 1.39e-4  0.0209   
##  2 Israel    <group… <lm>  year  -0.00685  0.00117      -5.85 1.89e-6  0.000331 
##  3 United S… <group… <lm>  year  -0.00624  0.000928     -6.72 1.37e-7  0.0000254
##  4 Belgium   <group… <lm>  year   0.00320  0.000765      4.19 2.07e-4  0.0301   
##  5 Guinea    <group… <lm>  year   0.00362  0.000833      4.35 1.87e-4  0.0275   
##  6 Morocco   <group… <lm>  year   0.00380  0.000860      4.42 1.46e-4  0.0218   
##  7 Belarus   <group… <lm>  year   0.00391  0.000759      5.15 1.28e-5  0.00208  
##  8 Iran      <group… <lm>  year   0.00391  0.000856      4.57 6.91e-5  0.0107   
##  9 Congo - … <group… <lm>  year   0.00397  0.000922      4.30 2.27e-4  0.0326   
## 10 Sudan     <group… <lm>  year   0.00399  0.000961      4.15 2.98e-4  0.0420   
## # … with 51 more rows

# Sort for the countries decreasing most quickly
filtered_countries %>% arrange(desc(estimate ))

## # A tibble: 61 × 9
##    country   data   model term  estimate std.error statistic  p.value p.adjusted
##    <chr>     <list> <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
##  1 South Af… <grou… <lm>  year   0.0119   0.00140       8.47 1.60e- 8    3.05e-6
##  2 Kazakhst… <grou… <lm>  year   0.0110   0.00195       5.62 3.24e- 4    4.51e-2
##  3 Yemen Ar… <grou… <lm>  year   0.0109   0.00159       6.84 1.20e- 6    2.11e-4
##  4 Kyrgyzst… <grou… <lm>  year   0.00973  0.000988      9.84 2.38e- 5    3.78e-3
##  5 Malawi    <grou… <lm>  year   0.00908  0.00181       5.02 4.48e- 5    7.03e-3
##  6 Dominica… <grou… <lm>  year   0.00806  0.000914      8.81 5.96e-10    1.17e-7
##  7 Portugal  <grou… <lm>  year   0.00802  0.00171       4.68 7.13e- 5    1.10e-2
##  8 Honduras  <grou… <lm>  year   0.00772  0.000921      8.38 1.43e- 9    2.81e-7
##  9 Peru      <grou… <lm>  year   0.00730  0.000976      7.48 1.65e- 8    3.12e-6
## 10 Nicaragua <grou… <lm>  year   0.00708  0.00107       6.60 1.92e- 7    3.55e-5
## # … with 51 more rows

4.6 Joining and tidying

In this chapter, you’ll learn to combine multiple related datasets, such as incorporating information about each resolution’s topic into your vote analysis. You’ll also learn how to turn untidy data into tidy data, and see how tidy data can guide your exploration of topics and countries over time.

# Print the votes_processed dataset
votes_processed

## # A tibble: 353,547 × 6
##     rcid session  vote ccode  year country           
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>             
##  1    46       2     1     2  1947 United States     
##  2    46       2     1    20  1947 Canada            
##  3    46       2     1    40  1947 Cuba              
##  4    46       2     1    41  1947 Haiti             
##  5    46       2     1    42  1947 Dominican Republic
##  6    46       2     1    70  1947 Mexico            
##  7    46       2     1    90  1947 Guatemala         
##  8    46       2     1    91  1947 Honduras          
##  9    46       2     1    92  1947 El Salvador       
## 10    46       2     1    93  1947 Nicaragua         
## # … with 353,537 more rows

# Print the descriptions dataset
descriptions

## # A tibble: 2,589 × 10
##     rcid session date                unres      me    nu    di    hr    co    ec
##    <dbl>   <dbl> <dttm>              <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1    46       2 1947-09-04 00:00:00 R/2/299     0     0     0     0     0     0
##  2    47       2 1947-10-05 00:00:00 R/2/355     0     0     0     1     0     0
##  3    48       2 1947-10-06 00:00:00 R/2/461     0     0     0     0     0     0
##  4    49       2 1947-10-06 00:00:00 R/2/463     0     0     0     0     0     0
##  5    50       2 1947-10-06 00:00:00 R/2/465     0     0     0     0     0     0
##  6    51       2 1947-10-02 00:00:00 R/2/561     0     0     0     0     1     0
##  7    52       2 1947-11-06 00:00:00 R/2/650     0     0     0     0     1     0
##  8    53       2 1947-11-06 00:00:00 R/2/651     0     0     0     0     1     0
##  9    54       2 1947-11-06 00:00:00 R/2/651     0     0     0     0     1     0
## 10    55       2 1947-11-06 00:00:00 R/2/667     0     0     0     0     1     0
## # … with 2,579 more rows

# Join them together based on the "rcid" and "session" columns
votes_joined <- votes_processed %>% inner_join(descriptions, by = c("rcid", "session"))

# Filter for votes related to colonialism
votes_joined %>% filter(co == 1)

## # A tibble: 60,962 × 14
##     rcid session  vote ccode  year country date                unres    me    nu
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>   <dttm>              <chr> <dbl> <dbl>
##  1    51       2     3     2  1947 United… 1947-10-02 00:00:00 R/2/…     0     0
##  2    51       2     3    20  1947 Canada  1947-10-02 00:00:00 R/2/…     0     0
##  3    51       2     2    40  1947 Cuba    1947-10-02 00:00:00 R/2/…     0     0
##  4    51       2     1    41  1947 Haiti   1947-10-02 00:00:00 R/2/…     0     0
##  5    51       2     3    42  1947 Domini… 1947-10-02 00:00:00 R/2/…     0     0
##  6    51       2     2    70  1947 Mexico  1947-10-02 00:00:00 R/2/…     0     0
##  7    51       2     2    90  1947 Guatem… 1947-10-02 00:00:00 R/2/…     0     0
##  8    51       2     2    92  1947 El Sal… 1947-10-02 00:00:00 R/2/…     0     0
##  9    51       2     3    93  1947 Nicara… 1947-10-02 00:00:00 R/2/…     0     0
## 10    51       2     2    95  1947 Panama  1947-10-02 00:00:00 R/2/…     0     0
## # … with 60,952 more rows, and 4 more variables: di <dbl>, hr <dbl>, co <dbl>,
## #   ec <dbl>

# Filter, then summarize by year: US_co_by_year
US_co_by_year <- votes_joined %>%
  filter(country == "United States", co == 1) %>%
  group_by(year) %>%
  summarize(percent_yes = mean(vote == 1))

# Graph the % of "yes" votes over time
ggplot(US_co_by_year, aes(year, percent_yes)) +
  geom_line()

4.6 Using gather to tidy a dataset

In order to represent the joined vote-topic data in a tidy form so we can analyze and graph by topic, we need to transform the data so that each row has one combination of country-vote-topic. This will change the data from having six columns (me, nu, di, hr, co, ec) to having two columns (topic and has_topic).

# Gather the six me/nu/di/hr/co/ec columns
votes_joined %>%
  gather(topic, has_topic, me:ec)

## # A tibble: 2,121,282 × 10
##     rcid session  vote ccode  year country       date                unres topic
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>         <dttm>              <chr> <chr>
##  1    46       2     1     2  1947 United States 1947-09-04 00:00:00 R/2/… me   
##  2    46       2     1    20  1947 Canada        1947-09-04 00:00:00 R/2/… me   
##  3    46       2     1    40  1947 Cuba          1947-09-04 00:00:00 R/2/… me   
##  4    46       2     1    41  1947 Haiti         1947-09-04 00:00:00 R/2/… me   
##  5    46       2     1    42  1947 Dominican Re… 1947-09-04 00:00:00 R/2/… me   
##  6    46       2     1    70  1947 Mexico        1947-09-04 00:00:00 R/2/… me   
##  7    46       2     1    90  1947 Guatemala     1947-09-04 00:00:00 R/2/… me   
##  8    46       2     1    91  1947 Honduras      1947-09-04 00:00:00 R/2/… me   
##  9    46       2     1    92  1947 El Salvador   1947-09-04 00:00:00 R/2/… me   
## 10    46       2     1    93  1947 Nicaragua     1947-09-04 00:00:00 R/2/… me   
## # … with 2,121,272 more rows, and 1 more variable: has_topic <dbl>

# Perform gather again, then filter
votes_gathered <- votes_joined %>%
  gather(topic, has_topic, me:ec) %>%
  filter(has_topic == 1)

# Replace the two-letter codes in topic: votes_tidied
votes_tidied <- votes_gathered %>%
  mutate(topic = recode(topic,
                        me = "Palestinian conflict",
                        nu = "Nuclear weapons and nuclear material",
                        di = "Arms control and disarmament",
                        hr = "Human rights",
                        co = "Colonialism",
                        ec = "Economic development"))

# Print votes_tidied
votes_tidied

## # A tibble: 350,032 × 10
##     rcid session  vote ccode  year country    date                unres topic   
##    <dbl>   <dbl> <dbl> <int> <dbl> <chr>      <dttm>              <chr> <chr>   
##  1    77       2     1     2  1947 United St… 1947-11-06 00:00:00 R/2/… Palesti…
##  2    77       2     1    20  1947 Canada     1947-11-06 00:00:00 R/2/… Palesti…
##  3    77       2     3    40  1947 Cuba       1947-11-06 00:00:00 R/2/… Palesti…
##  4    77       2     1    41  1947 Haiti      1947-11-06 00:00:00 R/2/… Palesti…
##  5    77       2     1    42  1947 Dominican… 1947-11-06 00:00:00 R/2/… Palesti…
##  6    77       2     2    70  1947 Mexico     1947-11-06 00:00:00 R/2/… Palesti…
##  7    77       2     1    90  1947 Guatemala  1947-11-06 00:00:00 R/2/… Palesti…
##  8    77       2     2    91  1947 Honduras   1947-11-06 00:00:00 R/2/… Palesti…
##  9    77       2     2    92  1947 El Salvad… 1947-11-06 00:00:00 R/2/… Palesti…
## 10    77       2     1    93  1947 Nicaragua  1947-11-06 00:00:00 R/2/… Palesti…
## # … with 350,022 more rows, and 1 more variable: has_topic <dbl>

# Summarize the percentage "yes" per country-year-topic
by_country_year_topic <- votes_tidied %>% group_by(country, year, topic) %>% 
summarize(total = n(), percent_yes = mean(vote==1)) %>%
ungroup()

## `summarise()` has grouped output by 'country', 'year'. You can override using
## the `.groups` argument.

# Print by_country_year_topic
by_country_year_topic

## # A tibble: 26,968 × 5
##    country      year topic                                total percent_yes
##    <chr>       <dbl> <chr>                                <int>       <dbl>
##  1 Afghanistan  1947 Colonialism                              8       0.5  
##  2 Afghanistan  1947 Economic development                     1       0    
##  3 Afghanistan  1947 Human rights                             1       0    
##  4 Afghanistan  1947 Palestinian conflict                     6       0    
##  5 Afghanistan  1949 Arms control and disarmament             3       0    
##  6 Afghanistan  1949 Colonialism                             22       0.864
##  7 Afghanistan  1949 Economic development                     1       1    
##  8 Afghanistan  1949 Human rights                             3       0    
##  9 Afghanistan  1949 Nuclear weapons and nuclear material     3       0    
## 10 Afghanistan  1949 Palestinian conflict                    11       0.818
## # … with 26,958 more rows

# Filter by_country_year_topic for just the US
US_by_country_year_topic <- by_country_year_topic %>% filter(country == "United States")

# Plot % yes over time for the US, faceting by topic
ggplot(US_by_country_year_topic, aes(year, percent_yes)) + 
geom_line() + facet_wrap(~ topic)

# Print by_country_year_topic
by_country_year_topic

## # A tibble: 26,968 × 5
##    country      year topic                                total percent_yes
##    <chr>       <dbl> <chr>                                <int>       <dbl>
##  1 Afghanistan  1947 Colonialism                              8       0.5  
##  2 Afghanistan  1947 Economic development                     1       0    
##  3 Afghanistan  1947 Human rights                             1       0    
##  4 Afghanistan  1947 Palestinian conflict                     6       0    
##  5 Afghanistan  1949 Arms control and disarmament             3       0    
##  6 Afghanistan  1949 Colonialism                             22       0.864
##  7 Afghanistan  1949 Economic development                     1       1    
##  8 Afghanistan  1949 Human rights                             3       0    
##  9 Afghanistan  1949 Nuclear weapons and nuclear material     3       0    
## 10 Afghanistan  1949 Palestinian conflict                    11       0.818
## # … with 26,958 more rows

# Fit model on the by_country_year_topic dataset
country_topic_coefficients <- by_country_year_topic %>%
  nest(-country, -topic) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied)

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

# Print country_topic_coefficients
country_topic_coefficients

## # A tibble: 2,384 × 9
##    country     topic    data    model term  estimate std.error statistic p.value
##    <chr>       <chr>    <list>  <lis> <chr>    <dbl>     <dbl>     <dbl>   <dbl>
##  1 Afghanistan Colonia… <tibbl… <lm>  (Int… -9.20e+0  1.96         -4.70 4.76e-5
##  2 Afghanistan Colonia… <tibbl… <lm>  year   5.11e-3  0.000989      5.17 1.23e-5
##  3 Afghanistan Economi… <tibbl… <lm>  (Int… -1.15e+1  3.62         -3.17 3.49e-3
##  4 Afghanistan Economi… <tibbl… <lm>  year   6.24e-3  0.00183       3.42 1.85e-3
##  5 Afghanistan Human r… <tibbl… <lm>  (Int… -7.27e+0  4.37         -1.66 1.06e-1
##  6 Afghanistan Human r… <tibbl… <lm>  year   4.08e-3  0.00221       1.85 7.43e-2
##  7 Afghanistan Palesti… <tibbl… <lm>  (Int… -1.33e+1  3.57         -3.73 8.66e-4
##  8 Afghanistan Palesti… <tibbl… <lm>  year   7.17e-3  0.00180       3.98 4.42e-4
##  9 Afghanistan Arms co… <tibbl… <lm>  (Int… -1.38e+1  4.13         -3.33 2.53e-3
## 10 Afghanistan Arms co… <tibbl… <lm>  year   7.37e-3  0.00208       3.54 1.49e-3
## # … with 2,374 more rows

# Create country_topic_filtered
country_topic_filtered <- country_topic_coefficients %>% filter(term == "year") %>% 
mutate(p.adjusted = p.adjust(p.value)) %>% filter(p.adjusted < 0.05)

4.7 Checking models visually

In the last exercise, you found that over its history, Vanuatu (an island nation in the Pacific Ocean) sharply changed its pattern of voting on the topic of Palestinian conflict.

# Create vanuatu_by_country_year_topic
vanuatu_by_country_year_topic <- by_country_year_topic %>% filter(country == "Vanuatu")

# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ topic)

The End.

Thanks DataCamp

- My Favorite Team -