Load Libraries

library(nycflights13)
library(tidyverse)
library(openintro)

In this lecture, we will continue to study summary functions for summarise(), then learn how to use the pipe operator %>% to make the code for a sequence of operations much easier and more clear.


1. Useful summary functions

R provides many other summary functions when using summarise() to summarize data. As below are some of the most helpful ones.

mean(), median()

  • Measures of location: we’ve used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.

It’s sometimes useful to combine aggregation with logical subsetting. For example,

not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
  avg_delay1 = mean(arr_delay),
  avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # … with 355 more rows


sum() and n()

The sum() function is commonly used to compute the relative frequency, combining with the counting function n().

my_summary <- summarise(group_by(flights, origin), n = n())  # n() counts how many rows there are for each group
my_summary
## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662
my_summary <- mutate(my_summary, prop = n/sum(n))  # Use `mutate()` to add a column of "proportion"
my_summary
## # A tibble: 3 × 3
##   origin      n  prop
##   <chr>   <int> <dbl>
## 1 EWR    120835 0.359
## 2 JFK    111279 0.330
## 3 LGA    104662 0.311
ggplot(my_summary) + stat_summary(mapping = aes(x = origin, y = prop), geom = "bar") +
labs(title = "Total Number of Flights from NYC in 2013", x = "Airport Code", y = "Relative Frequency") +
scale_y_continuous(labels = scales::percent)  +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))


Measures of spread: sd(x), IQR(x), mad(x).

The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range IQR() and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

For example, let’s find the destination with the highest variation in arrival delay

not_canceled <- filter(flights, !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_summary <- summarise(my_data, sd = sd(arr_delay))
arrange(my_summary, desc(sd))
## # A tibble: 104 × 2
##    dest     sd
##    <chr> <dbl>
##  1 HNL    60.8
##  2 TUL    60.3
##  3 TVC    59.3
##  4 CAK    58.3
##  5 TYS    58.0
##  6 SAT    57.8
##  7 DSM    56.9
##  8 MSN    56.8
##  9 BHM    56.2
## 10 CVG    55.1
## # … with 94 more rows


Measures of rank: min(x), quantile(x, 0.25), max(x).

Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

For example, if we hope to know the first and last flights leave on each day, we may do the following:

not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data, 
  first = min(dep_time),
  last = max(dep_time)
)
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # … with 355 more rows


first(), nth(), last()

  • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:
summarise(my_data,
  first_dep = first(dep_time), 
  last_dep = last(dep_time)
)
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # … with 355 more rows


n(), sum(!is.na()), n_distinct(), count()

  • Counts: n(), which takes no arguments, returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).

Counting is so frequently used for data analysis that there is a separate function count() to do the job. count(<data_name>, <variable_names>) is equivalent to first group by those variables then count the number of samples in each group.

count(flights, origin)
## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662


Which destinations have the most carriers?

my_data <- group_by(not_canceled, dest) 
my_sum <- summarise(my_data, carriers = n_distinct(carrier)) 
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows


Lab Exercise:

Find which origin has the most destinations?


2. The Pipe Operator %>%

In most examples so far, we need to do a sequence of operations on the data set to fulfill our task. For example:

not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 
my_data <- group_by(not_canceled, dest) 
my_sum <- summarise(my_data, carriers = n_distinct(carrier)) 
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows

As we see, we have to assign the transformed data to a new or existing data set in every step to save the intermediate results. when there are many steps to do, this is quite cumbersome and inconvenient. The code also becomes less clear for readers.

Instead, we can use the pipe operator %>% introduced in the package magrittr, which is automatically loaded when we load tidyverse so we don’t have to explicitly load it.

The code after using the pipe for the previous example looks like this:

my_data <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
  group_by(dest) %>%
  summarise(carriers = n_distinct(carrier)) %>%
  arrange(desc(carriers)) %>%
  print()
## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows

So the template for the pipe is:

<new_data_name> <- <original_data_name> %>%
  operation1() %>%
  operation2() %>%
  ...

Most importantly, in every operation function (basically all functions we have introduced so far from delyr) we can now ignore the data set name as the first argument and directly input the operations.


Lab Exercise

Rewrite the following analysis code in pipe operator:

my_data <- filter(diamonds, carat < 2)
my_data <- group_by(my_data, color)
my_summary <- summarise(my_data, mean_carat = mean(carat), mean_price = mean(price))
my_summary <- mutate(my_summary, mean_unit_price = mean_price/mean_carat)
print(my_summary)
## # A tibble: 7 × 4
##   color mean_carat mean_price mean_unit_price
##   <ord>      <dbl>      <dbl>           <dbl>
## 1 D          0.646      3063.           4743.
## 2 E          0.643      2944.           4577.
## 3 F          0.716      3547.           4952.
## 4 G          0.737      3720.           5047.
## 5 H          0.828      3765.           4548.
## 6 I          0.884      3814.           4314.
## 7 J          0.980      3838.           3917.


Example

Now let’s exercise using %>% to do analysis.

We will analyze the data set babies in the package openintro. Use the help documents to understand the meaning of variables.


Question 1: Histogram of mother’s age

This can be done by data visualization alone without any transformation.

ggplot(babies) + geom_histogram(aes(age), binwidth = 1, boundary = 20.5) + 
  labs(title = "Analysis of 'babies' Data Set", x = "Mother's age when giving birth") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))


Question 2: Is there any correlation between whether the mother smokes and the gestation length?

We can create a box plot to study this question. Again this only involves data visualization.

ggplot(babies) + 
  geom_boxplot(mapping = aes(x = as.factor(smoke), y = gestation))

This figure does not look very nice, so we may consider summarizing data instead of creating a graph. Note that we will create a new categorical variable with labels “non-smoker”, “smoker” and “NA” to replace “0”, “1” and “NA” in smoke variable.

smoke_gest <- babies %>%
  mutate(smoke_label = fct_recode(as.factor(smoke), "non-smoker" = "0", "smoker" = "1")) %>%
  group_by(smoke_label) %>%
  summarise(n = n(), mean_gest = mean(gestation, na.rm = TRUE), median_gest = median(gestation, na.rm = TRUE)) %>%
  print()
## # A tibble: 3 × 4
##   smoke_label     n mean_gest median_gest
##   <fct>       <int>     <dbl>       <dbl>
## 1 non-smoker    742      280.        281 
## 2 smoker        484      278.        279 
## 3 <NA>           10      282.        280.

As we see, there is little difference in the measures and it seems that smoking or not does not matter to the gestation length.


Lab Exercise

1. With the babies data set, investigate whether there is a correlation between mother’s weight and the gestation length.

2. Investigate all factors of mother in the data set to see which factor may have an non-negligible correlation with the gestation length.

3. Investigate all factors in the data set to see which factor may have an non-negligible correlation with the birthweight of babies.


3. Introduction to EDA (Exploratory Data Analysis)

The last example introduces the concept of EDA, which accounts for “Exploratory Data Analysis”. This is the goal of our course - teaching you how to do data exploration.

Go back to the diagram of the first lecture, we see that EDA is an iterative cycle (the part in the grey box). You:

  • Generate questions about your data.

  • Search for answers by visualising, transforming, and modelling your data.

  • Use what you learn to refine your questions and/or generate new questions.

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind.

During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.

The term data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

In the next lecture, we will exercise doing more EDA analysis.ß