Section 3 Overview

Section 3 introduces you to summarizing with dplyr.

After completing Section 3, you will:

understand the importance of summarizing data in exploratory data analysis.
be able to use the summarize() verb in dplyr to facilitate summarizing data.
be able to use the group_by() verb in dplyr to facilitate summarizing data.
be able to access values using the dot placeholder.
be able to use arrange() to examine data after sorting.

There is 1 assignment that uses the DataCamp platform for you to practice your coding skills.

We encourage you to use R to interactively test out your answers and further your learning.

dplyr

Transcript

An important part of exploratory data analysis is summarizing data. We’ve already learned about the average and the standard deviation, two summary statistics that provide all the necessary information needed to summarize data that is normally distributed. We also learned that better summaries can be achieved by splitting data into groups before using the normal approximation. For example, in our heights dataset, we describe the height of men and women separately.

In this section, we cover two new dplyr verbs that make these computations easier, summarize and group_by. We will also learn to access resulting values using what we call the dot placeholder. And finally, we will learn to use arrange, which helps us examine data after sorting.

We get started by loading the tidyverse library. Once the library is loaded, we’re ready to use summarize. The summarize function in dplyr provides a way to compute summary statistics with intuitive and readable code.

We start with a simple example based on our heights dataset. We load it using this code. In our first example, we’re going to compute the average and the standard deviation for males. So we write this code. We first filter to access only the male data. And then we use the function summarize. We say average equals the mean of heights and standard_deviation equals sd of heights. This gives us the average and the standard deviation.

In this code, we start with our original data table. We filter to keep only males. And then we produce a new summarize table with just the average and the standard deviation of heights. Note that we get to pick the names of the columns of the resulting table. For example, in the code we just wrote we decided to use average and standard_deviation, but we could have used any other names just the same.

Because the resulting table stored in s is a data frame, we can access the components with the accessor dollar sign, which in this case will be numeric. Here’s what we’ll get. If we type “s $ average”, we get the average we just computed. And if we type “s $ standard_deviation”, we get the standard deviation we just computed.

As with most other dplyr functions, Summarize is aware of the variable names and we can use them directly. So when inside the call to summarize function, we write mean of heights. We don’t have to write the data object name. And this accesses the column with that name. And then it just computes the average of the respective numeric vector.

Note that we can compute any summary that operates on vectors and returns a single value. For example, we can write code that computes the median, the mean and the max, like this. We follow the same code, but now inside the summarize function we compute the median, the min, the max of the vector defined by the column that has the name height. And we can see the results.

Note that in R base we can actually obtain the three values with just one line of code using the quantile functions. For example, if we typed quantile x and then use the vectors 0, 0, 0.5, and 1 as the quantiles that we want computed, it returns the three numbers we want.

However, if we attempt to use this function inside of summarize we get an error. Expecting result of length one, got three. With the function summarize, we can only call functions that return a single value.

Key points

summarize() from the dplyr/tidyverse package computes summary statistics from the data frame. It returns a data frame whose column names are defined within the function call.
summarize() can compute any summary function that operates on vectors and returns a single value, but it cannot operate on functions that return multiple values.
Like most dplyr functions, summarize() is aware of variable names within data frames and can use them directly.

Code

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dslabs)
data(heights)

# compute average and standard deviation for males
s <- heights %>%
    filter(sex == "Male") %>%
    summarize(average = mean(height), standard_deviation = sd(height))
    
# access average and standard deviation from summary table
s$average

## [1] 69.31475

s$standard_deviation

## [1] 3.611024

# compute median, min and max
heights %>%
    filter(sex == "Male") %>%
    summarize(median = median(height),
              minimum = min(height),
              maximum = max(height))

##   median minimum  maximum
## 1     69      50 82.67717

# alternative way to get min, median, max in base R
quantile(heights$height, c(0, 0.5, 1))

##       0%      50%     100% 
## 50.00000 68.50000 82.67717

# generates an error: summarize can only take functions that return a single value
# heights %>%
#    filter(sex == "Male") %>%
#    summarize(range = quantile(height, c(0, 0.5, 1)))

The Dot Placeholder

Transcript

In this video, we’re going to learn how to make dplyr functions return vectors as opposed to data frames. This can be useful sometimes. We’re going to use the US murder data as an example. Remember our data table includes total murders and population size for each state, and we have already used dplyr to add a murder rate column. We did it like this.

However, note that the US murder rate is not the average of the state murder rates. If you do that, you get 2.78. This is not the US murder rate, which is closer to 3. This is because in this computation we’re counting the small states just the same as the large states, and when we compute the average US murder rate, it needs to take into account bigger states more than smaller states. The US murder rate is proportional to the total US murders divided by the total US population.

So the correct computation is as follows. We can use summarize, and now we compute the sum of the total murders, and we divide by the sum of population, and this gives us the correct US murder rate, which is 3.03.

Now the US murder rate object defined above represents just one number. So suppose we want to use a function that requires just a numeric value, we can’t use the US murder rate object because it’s a data frame. Most of the dplyr functions, including summarize, always return data frames. So this might be problematic because, again, if we want to use the result with a function that requires a numeric value, we won’t be able to do it.

Here we show a useful trick to access value stored in data that is being piped using the pipe character. We’re going to access using something called a dot.

The best way to understand what we mean by this is to look at code. Note, that if we write this piece of code, we get a numeric 3.03. This returns the value in the rate column of the US murders rate table making it equivalent to typing US murders rate dollar sign rate.

To understand this line, you just need to think of the dot as a placeholder for the data that is being passed through the pipe.

Because this data object is a data frame, we can access its column using dot dollar sign and then the column name.

To get a number from the original data table with one line of code, we can type this. Notice that it’s the same code, but at the end we pipe, then we type dot dollar sign rate and this returns the numeric, not the data table.

Later we will see other instances in which using the dot is useful, but for now, we’ll only use it to produce numeric vectors from pipelines constructed with dplyr.

Key points

The dot operator allows you to access values stored in data that is being piped in using the %>% character. The dot is a placeholder for the data being passed in through the pipe.
The dot operator allows dplyr functions to return single vectors or numbers instead of only data frames.

us_murder_rate %>% . $ rate is equivalent to us_murder_rate $ rate.
Note that an equivalent way to extract a single column using the pipe is us_murder_rate %>% pull(rate). The pull() function will be used in later course material.

Code

library(tidyverse)
library(dslabs)
data(murders)

murders <- murders %>% mutate(murder_rate = total/population*100000)
summarize(murders, mean(murder_rate))

##   mean(murder_rate)
## 1          2.779125

# calculate US murder rate, generating a data frame
us_murder_rate <- murders %>%
    summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate

##       rate
## 1 3.034555

# extract the numeric US murder rate with the dot operator
us_murder_rate %>% .$rate

## [1] 3.034555

# calculate and extract the murder rate with one pipe
us_murder_rate <- murders %>%
    summarize(rate = sum(total) / sum(population * 100000)) %>%
    .$rate

Group By

Transcript

A very common operation and data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men and women heights separately. The group underscore by function helps us do this.

Let’s look at an example. If we take heights and we pipe it into the group by function that is using sex to group it by, we see this table. Although not immediately obvious from its appearance, this is now a special data frame called a “group data frame.” And dplyr functions, in particular summarize, will behave differently when acting on this object.

Conceptually, you can think of this table as many tables with the same columns but not necessarily the same rows that are stacked together into one object.

So note what happens when we summarize the data after grouping it? We write this piece of code where we first group by sex, and then we summarize asking that we get the mean and the standard deviation. The result is a table that has the average and the standard deviation for females and males separately.

For another example, let’s compute the median murder rate in the four regions of the country. We can write this code that looks a lot like the previous one. We type murders. We group it by region. And then we summarize it, asking to get back the median murder rate.

These are very common examples, and you will see many of these as we go along looking at different data examples.

Key points

The group_by() function from dplyr converts a data frame to a grouped data frame, creating groups using one or more variables.
summarize() and some other dplyr functions will behave differently on grouped data frames.
Using summarize() on a grouped data frame computes the summary statistics for each of the separate groups.

Code

# libraries and data
library(tidyverse)
library(dslabs)
data(heights)
data(murders)

# compute separate average and standard deviation for male/female heights
heights %>%
    group_by(sex) %>%
    summarize(average = mean(height), standard_deviation = sd(height))

## # A tibble: 2 x 3
##   sex    average standard_deviation
##   <fct>    <dbl>              <dbl>
## 1 Female    64.9               3.76
## 2 Male      69.3               3.61

# compute median murder rate in 4 regions of country
murders <- murders %>%
    mutate(murder_rate = total/population * 100000)
murders %>%
    group_by(region) %>%
    summarize(median_rate = median(murder_rate))

## # A tibble: 4 x 2
##   region        median_rate
##   <fct>               <dbl>
## 1 Northeast            1.80
## 2 South                3.40
## 3 North Central        1.97
## 4 West                 1.29

Sorting Data Tables

Transcript

When examining a data set, it is often convenient to sort the table by different columns. We know about the order and sort functions. But for ordering entire tables, the function arrange, in dplyr, is very useful.

For example, here, we order the states by their population size. All we have to do is type murders, and then pipe it to the arrange function. And tell it to sort it by population. Here are the first 6 states. We can see Wyoming is the smallest. DC is the second smallest, et cetera. Note that we get to decide which column to sort by. To see the states ordered by murder rate instead of population, we write almost the exact same code. Except now instead of population, we type murder rate. And we can see the 6 states with the lowest murder rates: Vermont, New Hampshire, Hawaii, et cetera.

Note that the default behavior is to sort in ascending order. In dplyr, the function D-E-S-C– which stands for descending– transforms a vector to be in descending order. So, if we instead type the following code, we get the states ordered by murder rate from highest to lowest. Note that now what we’re doing, is we’re using desc of murder rate instead of murder rate. And we can see that the states with the highest murder rates are DC, Louisiana, Missouri, et cetera.

We can also do nested sorting. Let’s say we’re ordering by a column, and there’s ties. We can break the ties with the second column, or a third or a fourth. In this example, we order by region. Then within each region, we order by murder rate. You can see that the first few states are all in the Northeast, because we’re ordering by region. And then within the Northeast, we’re ordering from lowest to highest murder rate.

Another useful function is the top_n function. In the code we’ve just seen, we have used the function head to avoid having the page fill with the entire data table. It shows us the first 6 rows. Now, if we want to see a larger proportion of the data– say the top 10– we can use the top_n function.

Here, using this code, we will show the 10 states with the highest murder rates. You can see it includes Arizona, Delaware, et cetera. However, they are not ordered. If we want to order, we have to use the arrange function. So here’s how we do it. We type murders. We pipe it into a range, which we then use murder rate to order the data. And then we pipe it into top_n. Now we don’t have to tell it what column to use, because top_n(10) will show us the top 10 rows that have already been ordered.

Key points

The arrange() function from dplyr sorts a data frame by a given column.
By default, arrange() sorts in ascending order (lowest to highest). To instead sort in descending order, use the function desc() inside of arrange().
You can arrange() by multiple levels: within equivalent values of the first level, observations are sorted by the second level, and so on.
The top_n() function shows the top results ranked by a given variable, but the results are not ordered. You can combine top_n() with arrange() to return the top results in order.

Code

# libraries and data
library(tidyverse)
library(dslabs)
data(murders)

# set up murders object
murders <- murders %>%
    mutate(murder_rate = total/population * 100000)
    
# arrange by population column, smallest to largest
murders %>% arrange(population) %>% head()

##                  state abb        region population total murder_rate
## 1              Wyoming  WY          West     563626     5   0.8871131
## 2 District of Columbia  DC         South     601723    99  16.4527532
## 3              Vermont  VT     Northeast     625741     2   0.3196211
## 4         North Dakota  ND North Central     672591     4   0.5947151
## 5               Alaska  AK          West     710231    19   2.6751860
## 6         South Dakota  SD North Central     814180     8   0.9825837

# arrange by murder rate, smallest to largest
murders %>% arrange(murder_rate) %>% head()

##           state abb        region population total murder_rate
## 1       Vermont  VT     Northeast     625741     2   0.3196211
## 2 New Hampshire  NH     Northeast    1316470     5   0.3798036
## 3        Hawaii  HI          West    1360301     7   0.5145920
## 4  North Dakota  ND North Central     672591     4   0.5947151
## 5          Iowa  IA North Central    3046355    21   0.6893484
## 6         Idaho  ID          West    1567582    12   0.7655102

# arrange by murder rate in descending order
murders %>% arrange(desc(murder_rate)) %>% head()

##                  state abb        region population total murder_rate
## 1 District of Columbia  DC         South     601723    99   16.452753
## 2            Louisiana  LA         South    4533372   351    7.742581
## 3             Missouri  MO North Central    5988927   321    5.359892
## 4             Maryland  MD         South    5773552   293    5.074866
## 5       South Carolina  SC         South    4625364   207    4.475323
## 6             Delaware  DE         South     897934    38    4.231937

# arrange by region alphabetically, then by murder rate within each region
murders %>% arrange(region, murder_rate) %>% head()

##           state abb    region population total murder_rate
## 1       Vermont  VT Northeast     625741     2   0.3196211
## 2 New Hampshire  NH Northeast    1316470     5   0.3798036
## 3         Maine  ME Northeast    1328361    11   0.8280881
## 4  Rhode Island  RI Northeast    1052567    16   1.5200933
## 5 Massachusetts  MA Northeast    6547629   118   1.8021791
## 6      New York  NY Northeast   19378102   517   2.6679599

# show the top 10 states with highest murder rate, not ordered by rate
murders %>% top_n(10, murder_rate)

##                   state abb        region population total murder_rate
## 1               Arizona  AZ          West    6392017   232    3.629527
## 2              Delaware  DE         South     897934    38    4.231937
## 3  District of Columbia  DC         South     601723    99   16.452753
## 4               Georgia  GA         South    9920000   376    3.790323
## 5             Louisiana  LA         South    4533372   351    7.742581
## 6              Maryland  MD         South    5773552   293    5.074866
## 7              Michigan  MI North Central    9883640   413    4.178622
## 8           Mississippi  MS         South    2967297   120    4.044085
## 9              Missouri  MO North Central    5988927   321    5.359892
## 10       South Carolina  SC         South    4625364   207    4.475323

# show the top 10 states with highest murder rate, ordered by rate
murders %>% arrange(desc(murder_rate)) %>% top_n(10)

## Selecting by murder_rate

##                   state abb        region population total murder_rate
## 1  District of Columbia  DC         South     601723    99   16.452753
## 2             Louisiana  LA         South    4533372   351    7.742581
## 3              Missouri  MO North Central    5988927   321    5.359892
## 4              Maryland  MD         South    5773552   293    5.074866
## 5        South Carolina  SC         South    4625364   207    4.475323
## 6              Delaware  DE         South     897934    38    4.231937
## 7              Michigan  MI North Central    9883640   413    4.178622
## 8           Mississippi  MS         South    2967297   120    4.044085
## 9               Georgia  GA         South    9920000   376    3.790323
## 10              Arizona  AZ          West    6392017   232    3.629527

Data Science: Visualization - Section 3 - dplyr

March 1, 2020

Section 3 Overview

dplyr

The Dot Placeholder

Group By

Sorting Data Tables