Libraries
library(tidyverse)
library(openintro)
library(assertive)
library(fst)
library(broom)
library(ggfortify)
library(yardstick)
library(moderndive)
library(plot3D)
library(magrittr)
library(infer)
library(anytime)
library(lubridate)
library(ggridges)
Data
cars<- read.csv("cars.csv")
comics <- read.csv("comics.csv")
immigrant <- read.csv("immigrant.csv")
life <- read.csv("life.csv")
names <- read.csv("names.csv")
income <- read.csv("income.csv")
votes <- read_rds("votes.rds")
descriptions<- read_rds("descriptions.rds")
When your dataset is represented as a table or a database, it’s difficult to observe much about it beyond its size and the types of variables it contains. In this course, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual? By the end of the course, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.
In this chapter, you will learn how to create graphical and numerical summaries of two categorical variables.
In this chapter you’ll continue working with the comics dataset introduced in the video. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.
Let’s start by creating a contingency table, which is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.
# Check levels of align
levels(comics$align)
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "1997, December" "Female" "Male" "Other"
which(grepl("1997, December", comics$gender))
## [1] 20261
# Create a 2-way contingency table
table(comics$align, comics$gender)
##
## 1997, December Female Male Other
## Bad 0 1550 7497 32
## Good 0 2472 4786 17
## Neutral 0 831 1794 17
## Reformed Criminals 0 1 2 0
tab <- table(comics$align, comics$gender)
tab <- tab[ ,2:4]
tab
##
## Female Male Other
## Bad 1550 7497 32
## Good 2472 4786 17
## Neutral 831 1794 17
## Reformed Criminals 1 2 0
The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.
In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels(). This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.
# Remove align level
comics_filtered <- comics %>%
filter(align != "Reformed Criminals") %>%
droplevels()
# See the result
head(comics_filtered)
## name id align eye hair
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes Brown Hair
## 2 Captain America (Steven Rogers) Public Good Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes Black Hair
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes Black Hair
## 5 Thor (Thor Odinson) No Dual Good Blue Eyes Blond Hair
## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes No Hair
## gender gsm alive appearances first_appear publisher
## 1 Male <NA> Living Characters 4043 Aug-62 marvel
## 2 Male <NA> Living Characters 3360 Mar-41 marvel
## 3 Male <NA> Living Characters 3061 Oct-74 marvel
## 4 Male <NA> Living Characters 2961 Mar-63 marvel
## 5 Male <NA> Living Characters 2258 Nov-50 marvel
## 6 Male <NA> Living Characters 2255 Nov-61 marvel
While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.This shows that there can often be two or more options for presenting the same data. Passing the argument position = “dodge” to geom_bar() says that you want a side-by-side (i.e. not stacked) barchart.
# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge")
# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))
The following code generates tables of joint and conditional proportions, respectively:
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill") +
ylab("proportion")
If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a barchart for just that single variable. You can improve the interpretability of the plot, though, by implementing some sensible ordering.
# Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
In this chapter, you will learn how to graphically summarize numerical data.
In this chapter, you’ll be working with the cars dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. You will investigate the distribution of mileage across a categorical variable, but before you get there, you’ll want to familiarize yourself with the dataset.
# Learn data structure
str(cars)
## 'data.frame': 428 obs. of 20 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).
The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this exercise you’ll try your hand at two alternatives: the box plot and the density plot.
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))
# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).
Now, turn your attention to a new variable: horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.
# Create hist of horsepwr
cars %>%
ggplot(aes(x = horsepwr)) +
geom_histogram() +
ggtitle("Distribution of horsepower for all cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create hist of horsepwr for affordable cars
cars %>%
filter(msrp < 25000) %>%
ggplot(aes(x = horsepwr)) +
geom_histogram() +
xlim(c(90, 550)) +
ggtitle("Distribution of horsepower for affordable cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Before you take these plots for granted, it’s a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("Distribution of horsepower for affordable cars")
# Create hist of horsepwr with binwidth of 30
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle("Distribution of horsepower for affordable cars")
# Create hist of horsepwr with binwidth of 60
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle("Distribution of horsepower for affordable cars")
n addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.
# Construct box plot of msrp
cars %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
# Exclude outliers from data
cars_no_out <- cars %>%
filter(msrp < 100000)
# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
Consider two other columns in the cars dataset: city_mpg and width. Which is the most appropriate plot for displaying the important features of their distributions? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.
# Try plotting a histogram of the two variables to see their distributions. Which has outliers and which doesn't? The variable with outliers should be displayed as a box plot and the variable without outliers should be displayed as a density plot.
# Create plot of city_mpg
cars %>%
ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
# Create plot of width
cars %>%
ggplot(aes(x = width)) +
geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).
Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.
# Facet hists using hwy mileage and ncyl
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv) +
ggtitle("Hists using hwy mileage and ncyl")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).
Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data.
library(gapminder)
head(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year==2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean = mean(lifeExp),
median = median(lifeExp))
## # A tibble: 5 × 3
## continent mean median
## <fct> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
Let’s extend the powerful group_by() and summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## # A tibble: 5 × 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fct> <dbl> <dbl> <int>
## 1 Africa 9.63 11.6 52
## 2 Americas 4.44 4.63 25
## 3 Asia 7.96 10.2 33
## 4 Europe 2.98 4.78 30
## 5 Oceania 0.729 0.516 2
# Generate overlaid density plots
gap2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them.
# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 × 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
# Compute stats for population
gap2007 %>%
summarize(median(pop),
IQR(pop))
## # A tibble: 1 × 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008.
Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.
Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).
# Create density plot of old variable
gap2007 %>%
ggplot(aes(x = pop)) +
geom_density()
# Transform the skewed pop variable
gap2007 <- gap2007 %>%
mutate(log_pop = log(pop))
# Create density plot of new variable
gap2007 %>%
ggplot(aes(x = log_pop)) +
geom_density()
Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min() or filter(), then proceed to building a plot with that country removed.
# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
filter(continent == "Asia") %>%
mutate(is_outlier = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
gap_asia %>%
filter(!is_outlier) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
Apply what you’ve learned to explore and summarize a real world dataset in this case study of email spam.
The best way to learn data wrangling skills is to apply them to a specific case study. Here you’ll learn how to clean and filter the United Nations voting dataset using the dplyr package, and how to summarize it into smaller, interpretable units.
library(countrycode)
# Convert country code 100
countrycode(100, "cown", "country.name")
## [1] "Colombia"
# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
filter(vote <= 3) %>%
mutate(year = session + 1945,
country = countrycode(ccode, "cown", "country.name"))
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 260
# Find total and fraction of "yes" votes
votes_processed %>% summarize(total = n(),
percent_yes = mean(vote==1))
## # A tibble: 1 × 2
## total percent_yes
## <int> <dbl>
## 1 353547 0.800
# Change this code to summarize by year
votes_processed %>%
group_by(year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## # A tibble: 34 × 3
## year total percent_yes
## <dbl> <int> <dbl>
## 1 1947 2039 0.569
## 2 1949 3469 0.438
## 3 1951 1434 0.585
## 4 1953 1537 0.632
## 5 1955 2169 0.695
## 6 1957 2708 0.609
## 7 1959 4326 0.588
## 8 1961 7482 0.573
## 9 1963 3308 0.729
## 10 1965 4382 0.708
## # … with 24 more rows
# Summarize by country: by_country
by_country<- votes_processed %>%
group_by(country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
# Now sort in descending order
by_country %>% arrange(desc(percent_yes))
## # A tibble: 200 × 3
## country total percent_yes
## <chr> <int> <dbl>
## 1 São Tomé & Príncipe 1091 0.976
## 2 Seychelles 881 0.975
## 3 Djibouti 1598 0.961
## 4 Guinea-Bissau 1538 0.960
## 5 Timor-Leste 326 0.957
## 6 Mauritius 1831 0.950
## 7 Zimbabwe 1361 0.949
## 8 Comoros 1133 0.947
## 9 United Arab Emirates 1934 0.947
## 10 Mozambique 1701 0.947
## # … with 190 more rows
# Filter out countries with fewer than 100 votes
by_country %>%
arrange(percent_yes) %>%filter(total > 100)
## # A tibble: 197 × 3
## country total percent_yes
## <chr> <int> <dbl>
## 1 United States 2568 0.269
## 2 Palau 369 0.339
## 3 Israel 2380 0.341
## 4 <NA> 1075 0.397
## 5 United Kingdom 2558 0.417
## 6 France 2527 0.427
## 7 Micronesia (Federated States of) 724 0.442
## 8 Marshall Islands 757 0.491
## 9 Belgium 2568 0.492
## 10 Canada 2576 0.508
## # … with 187 more rows
Once you’ve cleaned and summarized data, you’ll want to visualize them to understand trends and extract insights. Here you’ll use the ggplot2 package to explore trends in United Nations voting within each country over time.
by_year <- votes_processed %>%
group_by(year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
# Create line plot
ggplot(by_year, aes(year, percent_yes)) + geom_line()
# Change to scatter plot and add smoothing curve
ggplot(by_year, aes(year, percent_yes)) +
geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Group by year and country: by_year_country
by_year_country <- votes_processed %>%
group_by(year, country) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# Create a filtered version: UK_by_year
UK_by_year <- by_year_country %>% filter(country == "United Kingdom")
# Line plot of percent_yes over time for UK only
ggplot(UK_by_year, aes(year, percent_yes)) +
geom_line()
# Vector of four countries to examine
countries <- c("United States", "United Kingdom",
"France", "India")
# Filter by_year_country: f iltered_4_countries
filtered_4_countries <- by_year_country %>% filter(country %in% countries)
# Line plot of % yes in four countries
ggplot(filtered_4_countries, aes(year, percent_yes, color = country)) +
geom_line()
# Vector of six countries to examine
countries <- c("United States", "United Kingdom",
"France", "Japan", "Brazil", "India")
# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>% filter(country %in% countries)
# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ country, scale = "free_y")
While visualization helps you understand one country at a time, statistical modeling lets you quantify trends across many countries and interpret them together. Here you’ll learn to use the tidyr, purrr, and broom packages to fit linear models to each country, and understand and compare their outputs.
# Percentage of yes votes from the US by year: US_by_year
US_by_year <- by_year_country %>%
filter(country == "United States")
# Print the US_by_year data
US_by_year
## # A tibble: 34 × 4
## # Groups: year [34]
## year country total percent_yes
## <dbl> <chr> <int> <dbl>
## 1 1947 United States 38 0.711
## 2 1949 United States 64 0.281
## 3 1951 United States 25 0.4
## 4 1953 United States 26 0.5
## 5 1955 United States 37 0.622
## 6 1957 United States 34 0.647
## 7 1959 United States 54 0.426
## 8 1961 United States 75 0.507
## 9 1963 United States 32 0.5
## 10 1965 United States 41 0.366
## # … with 24 more rows
# Perform a linear regression of percent_yes by year: US_fit
US_fit <- lm(percent_yes ~ year, US_by_year)
# Perform summary() on the US_fit object
summary(US_fit)
##
## Call:
## lm(formula = percent_yes ~ year, data = US_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.222491 -0.080635 -0.008661 0.081948 0.194307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.6641455 1.8379743 6.890 8.48e-08 ***
## year -0.0062393 0.0009282 -6.722 1.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1062 on 32 degrees of freedom
## Multiple R-squared: 0.5854, Adjusted R-squared: 0.5724
## F-statistic: 45.18 on 1 and 32 DF, p-value: 1.367e-07
# Call the tidy() function on the US_fit object
tidy(US_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.7 1.84 6.89 0.0000000848
## 2 year -0.00624 0.000928 -6.72 0.000000137
# Fit model for the United Kingdom
UK_by_year <- by_year_country %>%
filter(country == "United Kingdom")
UK_fit <- lm(percent_yes ~ year, UK_by_year)
# Create US_tidied and UK_tidied
US_tidied <- tidy(US_fit)
UK_tidied <- tidy(UK_fit)
# Combine the two tidied models
bind_rows(US_tidied, UK_tidied)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.7 1.84 6.89 0.0000000848
## 2 year -0.00624 0.000928 -6.72 0.000000137
## 3 (Intercept) -3.27 1.96 -1.67 0.105
## 4 year 0.00187 0.000989 1.89 0.0677
Right now, the by_year_country data frame has one row per country-vote pair. So that you can model each country individually, you’re going to “nest” all columns besides country, which will result in a data frame with one row per country. The data for each individual country will then be stored in a list column called data.
# Nest all columns besides country
by_year_country %>%
nest(-country)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## # A tibble: 200 × 2
## country data
## <chr> <list>
## 1 Afghanistan <grouped_df [34 × 3]>
## 2 Argentina <grouped_df [34 × 3]>
## 3 Australia <grouped_df [34 × 3]>
## 4 Belarus <grouped_df [34 × 3]>
## 5 Belgium <grouped_df [34 × 3]>
## 6 Bolivia <grouped_df [34 × 3]>
## 7 Brazil <grouped_df [34 × 3]>
## 8 Canada <grouped_df [34 × 3]>
## 9 Chile <grouped_df [34 × 3]>
## 10 Colombia <grouped_df [34 × 3]>
## # … with 190 more rows
# All countries are nested besides country
nested <- by_year_country %>%
nest(-country)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
# Unnest the data column to return it to its original form
unnest(nested)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(data)`
## # A tibble: 4,744 × 4
## country year total percent_yes
## <chr> <dbl> <int> <dbl>
## 1 Afghanistan 1947 34 0.382
## 2 Afghanistan 1949 51 0.608
## 3 Afghanistan 1951 25 0.76
## 4 Afghanistan 1953 26 0.769
## 5 Afghanistan 1955 37 0.730
## 6 Afghanistan 1957 34 0.529
## 7 Afghanistan 1959 54 0.611
## 8 Afghanistan 1961 76 0.605
## 9 Afghanistan 1963 32 0.781
## 10 Afghanistan 1965 40 0.85
## # … with 4,734 more rows
# Perform a linear regression on each item in the data column
by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)))
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## # A tibble: 200 × 3
## country data model
## <chr> <list> <list>
## 1 Afghanistan <grouped_df [34 × 3]> <lm>
## 2 Argentina <grouped_df [34 × 3]> <lm>
## 3 Australia <grouped_df [34 × 3]> <lm>
## 4 Belarus <grouped_df [34 × 3]> <lm>
## 5 Belgium <grouped_df [34 × 3]> <lm>
## 6 Bolivia <grouped_df [34 × 3]> <lm>
## 7 Brazil <grouped_df [34 × 3]> <lm>
## 8 Canada <grouped_df [34 × 3]> <lm>
## 9 Chile <grouped_df [34 × 3]> <lm>
## 10 Colombia <grouped_df [34 × 3]> <lm>
## # … with 190 more rows
# Add another mutate that applies tidy() to each model
by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .))) %>%
mutate(tidied=map(model, tidy))
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## # A tibble: 200 × 4
## country data model tidied
## <chr> <list> <list> <list>
## 1 Afghanistan <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 2 Argentina <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 3 Australia <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 4 Belarus <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 5 Belgium <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 6 Bolivia <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 7 Brazil <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 8 Canada <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 9 Chile <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## 10 Colombia <grouped_df [34 × 3]> <lm> <tibble [2 × 5]>
## # … with 190 more rows
# Add one more step that unnests the tidied column
country_coefficients <- by_year_country %>%
nest(-country) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
# Print the resulting country_coefficients variable
country_coefficients
## # A tibble: 400 × 8
## country data model term estimate std.error statistic p.value
## <chr> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan <grouped_d… <lm> (Inter… -1.11e+1 1.47 -7.52 1.44e-8
## 2 Afghanistan <grouped_d… <lm> year 6.01e-3 0.000743 8.09 3.06e-9
## 3 Argentina <grouped_d… <lm> (Inter… -9.46e+0 2.10 -4.50 8.32e-5
## 4 Argentina <grouped_d… <lm> year 5.15e-3 0.00106 4.85 3.05e-5
## 5 Australia <grouped_d… <lm> (Inter… -4.55e+0 2.15 -2.12 4.22e-2
## 6 Australia <grouped_d… <lm> year 2.57e-3 0.00108 2.37 2.42e-2
## 7 Belarus <grouped_d… <lm> (Inter… -7.00e+0 1.50 -4.66 5.33e-5
## 8 Belarus <grouped_d… <lm> year 3.91e-3 0.000759 5.15 1.28e-5
## 9 Belgium <grouped_d… <lm> (Inter… -5.85e+0 1.52 -3.86 5.22e-4
## 10 Belgium <grouped_d… <lm> year 3.20e-3 0.000765 4.19 2.07e-4
## # … with 390 more rows
Not all slopes are significant, and you can use the p-value to guess which are and which are not. However, when you have lots of p-values, like one for each country, you run into the problem of multiple hypothesis testing, where you have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value) on a vector of p-values returns a set that you can trust.
# Filter for only the slope terms
country_coefficients %>%
filter(term == "year")
## # A tibble: 200 × 8
## country data model term estimate std.error statistic p.value
## <chr> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan <grouped_df … <lm> year 0.00601 0.000743 8.09 3.06e-9
## 2 Argentina <grouped_df … <lm> year 0.00515 0.00106 4.85 3.05e-5
## 3 Australia <grouped_df … <lm> year 0.00257 0.00108 2.37 2.42e-2
## 4 Belarus <grouped_df … <lm> year 0.00391 0.000759 5.15 1.28e-5
## 5 Belgium <grouped_df … <lm> year 0.00320 0.000765 4.19 2.07e-4
## 6 Bolivia <grouped_df … <lm> year 0.00580 0.000966 6.01 1.06e-6
## 7 Brazil <grouped_df … <lm> year 0.00611 0.000817 7.48 1.64e-8
## 8 Canada <grouped_df … <lm> year 0.00152 0.000955 1.59 1.22e-1
## 9 Chile <grouped_df … <lm> year 0.00678 0.000822 8.24 2.05e-9
## 10 Colombia <grouped_df … <lm> year 0.00616 0.000965 6.38 3.58e-7
## # … with 190 more rows
# Filter for only the slope terms
slope_terms <- country_coefficients %>%
filter(term == "year")
# Add p.adjusted column, then filter
slope_terms %>% mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < 0.05)
## # A tibble: 61 × 9
## country data model term estimate std.error statistic p.value p.adjusted
## <chr> <lis> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan <gro… <lm> year 0.00601 0.000743 8.09 3.06e-9 5.95e-7
## 2 Argentina <gro… <lm> year 0.00515 0.00106 4.85 3.05e-5 4.81e-3
## 3 Belarus <gro… <lm> year 0.00391 0.000759 5.15 1.28e-5 2.08e-3
## 4 Belgium <gro… <lm> year 0.00320 0.000765 4.19 2.07e-4 3.01e-2
## 5 Bolivia <gro… <lm> year 0.00580 0.000966 6.01 1.06e-6 1.88e-4
## 6 Brazil <gro… <lm> year 0.00611 0.000817 7.48 1.64e-8 3.12e-6
## 7 Chile <gro… <lm> year 0.00678 0.000822 8.24 2.05e-9 3.99e-7
## 8 Colombia <gro… <lm> year 0.00616 0.000965 6.38 3.58e-7 6.56e-5
## 9 Costa Rica <gro… <lm> year 0.00654 0.000812 8.05 3.39e-9 6.54e-7
## 10 Cuba <gro… <lm> year 0.00461 0.000721 6.40 3.43e-7 6.31e-5
## # … with 51 more rows
# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < .05)
# Sort for the countries increasing most quickly
filtered_countries %>% arrange(estimate )
## # A tibble: 61 × 9
## country data model term estimate std.error statistic p.value p.adjusted
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South Ko… <group… <lm> year -0.00921 0.00155 -5.96 1.39e-4 0.0209
## 2 Israel <group… <lm> year -0.00685 0.00117 -5.85 1.89e-6 0.000331
## 3 United S… <group… <lm> year -0.00624 0.000928 -6.72 1.37e-7 0.0000254
## 4 Belgium <group… <lm> year 0.00320 0.000765 4.19 2.07e-4 0.0301
## 5 Guinea <group… <lm> year 0.00362 0.000833 4.35 1.87e-4 0.0275
## 6 Morocco <group… <lm> year 0.00380 0.000860 4.42 1.46e-4 0.0218
## 7 Belarus <group… <lm> year 0.00391 0.000759 5.15 1.28e-5 0.00208
## 8 Iran <group… <lm> year 0.00391 0.000856 4.57 6.91e-5 0.0107
## 9 Congo - … <group… <lm> year 0.00397 0.000922 4.30 2.27e-4 0.0326
## 10 Sudan <group… <lm> year 0.00399 0.000961 4.15 2.98e-4 0.0420
## # … with 51 more rows
# Sort for the countries decreasing most quickly
filtered_countries %>% arrange(desc(estimate ))
## # A tibble: 61 × 9
## country data model term estimate std.error statistic p.value p.adjusted
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South Af… <grou… <lm> year 0.0119 0.00140 8.47 1.60e- 8 3.05e-6
## 2 Kazakhst… <grou… <lm> year 0.0110 0.00195 5.62 3.24e- 4 4.51e-2
## 3 Yemen Ar… <grou… <lm> year 0.0109 0.00159 6.84 1.20e- 6 2.11e-4
## 4 Kyrgyzst… <grou… <lm> year 0.00973 0.000988 9.84 2.38e- 5 3.78e-3
## 5 Malawi <grou… <lm> year 0.00908 0.00181 5.02 4.48e- 5 7.03e-3
## 6 Dominica… <grou… <lm> year 0.00806 0.000914 8.81 5.96e-10 1.17e-7
## 7 Portugal <grou… <lm> year 0.00802 0.00171 4.68 7.13e- 5 1.10e-2
## 8 Honduras <grou… <lm> year 0.00772 0.000921 8.38 1.43e- 9 2.81e-7
## 9 Peru <grou… <lm> year 0.00730 0.000976 7.48 1.65e- 8 3.12e-6
## 10 Nicaragua <grou… <lm> year 0.00708 0.00107 6.60 1.92e- 7 3.55e-5
## # … with 51 more rows
In this chapter, you’ll learn to combine multiple related datasets, such as incorporating information about each resolution’s topic into your vote analysis. You’ll also learn how to turn untidy data into tidy data, and see how tidy data can guide your exploration of topics and countries over time.
# Print the votes_processed dataset
votes_processed
## # A tibble: 353,547 × 6
## rcid session vote ccode year country
## <dbl> <dbl> <dbl> <int> <dbl> <chr>
## 1 46 2 1 2 1947 United States
## 2 46 2 1 20 1947 Canada
## 3 46 2 1 40 1947 Cuba
## 4 46 2 1 41 1947 Haiti
## 5 46 2 1 42 1947 Dominican Republic
## 6 46 2 1 70 1947 Mexico
## 7 46 2 1 90 1947 Guatemala
## 8 46 2 1 91 1947 Honduras
## 9 46 2 1 92 1947 El Salvador
## 10 46 2 1 93 1947 Nicaragua
## # … with 353,537 more rows
# Print the descriptions dataset
descriptions
## # A tibble: 2,589 × 10
## rcid session date unres me nu di hr co ec
## <dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 46 2 1947-09-04 00:00:00 R/2/299 0 0 0 0 0 0
## 2 47 2 1947-10-05 00:00:00 R/2/355 0 0 0 1 0 0
## 3 48 2 1947-10-06 00:00:00 R/2/461 0 0 0 0 0 0
## 4 49 2 1947-10-06 00:00:00 R/2/463 0 0 0 0 0 0
## 5 50 2 1947-10-06 00:00:00 R/2/465 0 0 0 0 0 0
## 6 51 2 1947-10-02 00:00:00 R/2/561 0 0 0 0 1 0
## 7 52 2 1947-11-06 00:00:00 R/2/650 0 0 0 0 1 0
## 8 53 2 1947-11-06 00:00:00 R/2/651 0 0 0 0 1 0
## 9 54 2 1947-11-06 00:00:00 R/2/651 0 0 0 0 1 0
## 10 55 2 1947-11-06 00:00:00 R/2/667 0 0 0 0 1 0
## # … with 2,579 more rows
# Join them together based on the "rcid" and "session" columns
votes_joined <- votes_processed %>% inner_join(descriptions, by = c("rcid", "session"))
# Filter for votes related to colonialism
votes_joined %>% filter(co == 1)
## # A tibble: 60,962 × 14
## rcid session vote ccode year country date unres me nu
## <dbl> <dbl> <dbl> <int> <dbl> <chr> <dttm> <chr> <dbl> <dbl>
## 1 51 2 3 2 1947 United… 1947-10-02 00:00:00 R/2/… 0 0
## 2 51 2 3 20 1947 Canada 1947-10-02 00:00:00 R/2/… 0 0
## 3 51 2 2 40 1947 Cuba 1947-10-02 00:00:00 R/2/… 0 0
## 4 51 2 1 41 1947 Haiti 1947-10-02 00:00:00 R/2/… 0 0
## 5 51 2 3 42 1947 Domini… 1947-10-02 00:00:00 R/2/… 0 0
## 6 51 2 2 70 1947 Mexico 1947-10-02 00:00:00 R/2/… 0 0
## 7 51 2 2 90 1947 Guatem… 1947-10-02 00:00:00 R/2/… 0 0
## 8 51 2 2 92 1947 El Sal… 1947-10-02 00:00:00 R/2/… 0 0
## 9 51 2 3 93 1947 Nicara… 1947-10-02 00:00:00 R/2/… 0 0
## 10 51 2 2 95 1947 Panama 1947-10-02 00:00:00 R/2/… 0 0
## # … with 60,952 more rows, and 4 more variables: di <dbl>, hr <dbl>, co <dbl>,
## # ec <dbl>
# Filter, then summarize by year: US_co_by_year
US_co_by_year <- votes_joined %>%
filter(country == "United States", co == 1) %>%
group_by(year) %>%
summarize(percent_yes = mean(vote == 1))
# Graph the % of "yes" votes over time
ggplot(US_co_by_year, aes(year, percent_yes)) +
geom_line()
In order to represent the joined vote-topic data in a tidy form so we can analyze and graph by topic, we need to transform the data so that each row has one combination of country-vote-topic. This will change the data from having six columns (me, nu, di, hr, co, ec) to having two columns (topic and has_topic).
# Gather the six me/nu/di/hr/co/ec columns
votes_joined %>%
gather(topic, has_topic, me:ec)
## # A tibble: 2,121,282 × 10
## rcid session vote ccode year country date unres topic
## <dbl> <dbl> <dbl> <int> <dbl> <chr> <dttm> <chr> <chr>
## 1 46 2 1 2 1947 United States 1947-09-04 00:00:00 R/2/… me
## 2 46 2 1 20 1947 Canada 1947-09-04 00:00:00 R/2/… me
## 3 46 2 1 40 1947 Cuba 1947-09-04 00:00:00 R/2/… me
## 4 46 2 1 41 1947 Haiti 1947-09-04 00:00:00 R/2/… me
## 5 46 2 1 42 1947 Dominican Re… 1947-09-04 00:00:00 R/2/… me
## 6 46 2 1 70 1947 Mexico 1947-09-04 00:00:00 R/2/… me
## 7 46 2 1 90 1947 Guatemala 1947-09-04 00:00:00 R/2/… me
## 8 46 2 1 91 1947 Honduras 1947-09-04 00:00:00 R/2/… me
## 9 46 2 1 92 1947 El Salvador 1947-09-04 00:00:00 R/2/… me
## 10 46 2 1 93 1947 Nicaragua 1947-09-04 00:00:00 R/2/… me
## # … with 2,121,272 more rows, and 1 more variable: has_topic <dbl>
# Perform gather again, then filter
votes_gathered <- votes_joined %>%
gather(topic, has_topic, me:ec) %>%
filter(has_topic == 1)
# Replace the two-letter codes in topic: votes_tidied
votes_tidied <- votes_gathered %>%
mutate(topic = recode(topic,
me = "Palestinian conflict",
nu = "Nuclear weapons and nuclear material",
di = "Arms control and disarmament",
hr = "Human rights",
co = "Colonialism",
ec = "Economic development"))
# Print votes_tidied
votes_tidied
## # A tibble: 350,032 × 10
## rcid session vote ccode year country date unres topic
## <dbl> <dbl> <dbl> <int> <dbl> <chr> <dttm> <chr> <chr>
## 1 77 2 1 2 1947 United St… 1947-11-06 00:00:00 R/2/… Palesti…
## 2 77 2 1 20 1947 Canada 1947-11-06 00:00:00 R/2/… Palesti…
## 3 77 2 3 40 1947 Cuba 1947-11-06 00:00:00 R/2/… Palesti…
## 4 77 2 1 41 1947 Haiti 1947-11-06 00:00:00 R/2/… Palesti…
## 5 77 2 1 42 1947 Dominican… 1947-11-06 00:00:00 R/2/… Palesti…
## 6 77 2 2 70 1947 Mexico 1947-11-06 00:00:00 R/2/… Palesti…
## 7 77 2 1 90 1947 Guatemala 1947-11-06 00:00:00 R/2/… Palesti…
## 8 77 2 2 91 1947 Honduras 1947-11-06 00:00:00 R/2/… Palesti…
## 9 77 2 2 92 1947 El Salvad… 1947-11-06 00:00:00 R/2/… Palesti…
## 10 77 2 1 93 1947 Nicaragua 1947-11-06 00:00:00 R/2/… Palesti…
## # … with 350,022 more rows, and 1 more variable: has_topic <dbl>
# Summarize the percentage "yes" per country-year-topic
by_country_year_topic <- votes_tidied %>% group_by(country, year, topic) %>%
summarize(total = n(), percent_yes = mean(vote==1)) %>%
ungroup()
## `summarise()` has grouped output by 'country', 'year'. You can override using
## the `.groups` argument.
# Print by_country_year_topic
by_country_year_topic
## # A tibble: 26,968 × 5
## country year topic total percent_yes
## <chr> <dbl> <chr> <int> <dbl>
## 1 Afghanistan 1947 Colonialism 8 0.5
## 2 Afghanistan 1947 Economic development 1 0
## 3 Afghanistan 1947 Human rights 1 0
## 4 Afghanistan 1947 Palestinian conflict 6 0
## 5 Afghanistan 1949 Arms control and disarmament 3 0
## 6 Afghanistan 1949 Colonialism 22 0.864
## 7 Afghanistan 1949 Economic development 1 1
## 8 Afghanistan 1949 Human rights 3 0
## 9 Afghanistan 1949 Nuclear weapons and nuclear material 3 0
## 10 Afghanistan 1949 Palestinian conflict 11 0.818
## # … with 26,958 more rows
# Filter by_country_year_topic for just the US
US_by_country_year_topic <- by_country_year_topic %>% filter(country == "United States")
# Plot % yes over time for the US, faceting by topic
ggplot(US_by_country_year_topic, aes(year, percent_yes)) +
geom_line() + facet_wrap(~ topic)
# Print by_country_year_topic
by_country_year_topic
## # A tibble: 26,968 × 5
## country year topic total percent_yes
## <chr> <dbl> <chr> <int> <dbl>
## 1 Afghanistan 1947 Colonialism 8 0.5
## 2 Afghanistan 1947 Economic development 1 0
## 3 Afghanistan 1947 Human rights 1 0
## 4 Afghanistan 1947 Palestinian conflict 6 0
## 5 Afghanistan 1949 Arms control and disarmament 3 0
## 6 Afghanistan 1949 Colonialism 22 0.864
## 7 Afghanistan 1949 Economic development 1 1
## 8 Afghanistan 1949 Human rights 3 0
## 9 Afghanistan 1949 Nuclear weapons and nuclear material 3 0
## 10 Afghanistan 1949 Palestinian conflict 11 0.818
## # … with 26,958 more rows
# Fit model on the by_country_year_topic dataset
country_topic_coefficients <- by_country_year_topic %>%
nest(-country, -topic) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
## Warning: All elements of `...` must be named.
## Did you want `data = c(year, total, percent_yes)`?
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
# Print country_topic_coefficients
country_topic_coefficients
## # A tibble: 2,384 × 9
## country topic data model term estimate std.error statistic p.value
## <chr> <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Colonia… <tibbl… <lm> (Int… -9.20e+0 1.96 -4.70 4.76e-5
## 2 Afghanistan Colonia… <tibbl… <lm> year 5.11e-3 0.000989 5.17 1.23e-5
## 3 Afghanistan Economi… <tibbl… <lm> (Int… -1.15e+1 3.62 -3.17 3.49e-3
## 4 Afghanistan Economi… <tibbl… <lm> year 6.24e-3 0.00183 3.42 1.85e-3
## 5 Afghanistan Human r… <tibbl… <lm> (Int… -7.27e+0 4.37 -1.66 1.06e-1
## 6 Afghanistan Human r… <tibbl… <lm> year 4.08e-3 0.00221 1.85 7.43e-2
## 7 Afghanistan Palesti… <tibbl… <lm> (Int… -1.33e+1 3.57 -3.73 8.66e-4
## 8 Afghanistan Palesti… <tibbl… <lm> year 7.17e-3 0.00180 3.98 4.42e-4
## 9 Afghanistan Arms co… <tibbl… <lm> (Int… -1.38e+1 4.13 -3.33 2.53e-3
## 10 Afghanistan Arms co… <tibbl… <lm> year 7.37e-3 0.00208 3.54 1.49e-3
## # … with 2,374 more rows
# Create country_topic_filtered
country_topic_filtered <- country_topic_coefficients %>% filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>% filter(p.adjusted < 0.05)
In the last exercise, you found that over its history, Vanuatu (an island nation in the Pacific Ocean) sharply changed its pattern of voting on the topic of Palestinian conflict.
# Create vanuatu_by_country_year_topic
vanuatu_by_country_year_topic <- by_country_year_topic %>% filter(country == "Vanuatu")
# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(year, percent_yes)) +
geom_line() +
facet_wrap(~ topic)
The End.
Thanks DataCamp
- My Favorite Team -