library(nycflights13)
library(tidyverse)
library(openintro)
In this lecture, we will continue to study summary functions for
summarise()
, then learn how to use the pipe operator
%>%
to make the code for a sequence of operations much
easier and more clear.
R provides many other summary functions when using
summarise()
to summarize data. As below are some of the
most helpful ones.
mean()
, median()
mean(x)
, but
median(x)
is also useful. The mean is the sum divided by
the length; the median is a value where 50% of x
is above
it, and 50% is below it.It’s sometimes useful to combine aggregation with logical subsetting. For example,
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # … with 355 more rows
sum()
and n()
The sum()
function is commonly used to compute the
relative frequency, combining with the counting function
n()
.
my_summary <- summarise(group_by(flights, origin), n = n()) # n() counts how many rows there are for each group
my_summary
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
my_summary <- mutate(my_summary, prop = n/sum(n)) # Use `mutate()` to add a column of "proportion"
my_summary
## # A tibble: 3 × 3
## origin n prop
## <chr> <int> <dbl>
## 1 EWR 120835 0.359
## 2 JFK 111279 0.330
## 3 LGA 104662 0.311
ggplot(my_summary) + stat_summary(mapping = aes(x = origin, y = prop), geom = "bar") +
labs(title = "Total Number of Flights from NYC in 2013", x = "Airport Code", y = "Relative Frequency") +
scale_y_continuous(labels = scales::percent) +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
sd(x)
, IQR(x)
,
mad(x)
.The mean squared deviation, or standard deviation or sd for short, is
the standard measure of spread. The interquartile range
IQR()
and median absolute deviation mad(x)
are
robust equivalents that may be more useful if you have outliers.
For example, let’s find the destination with the highest variation in arrival delay
not_canceled <- filter(flights, !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_summary <- summarise(my_data, sd = sd(arr_delay))
arrange(my_summary, desc(sd))
## # A tibble: 104 × 2
## dest sd
## <chr> <dbl>
## 1 HNL 60.8
## 2 TUL 60.3
## 3 TVC 59.3
## 4 CAK 58.3
## 5 TYS 58.0
## 6 SAT 57.8
## 7 DSM 56.9
## 8 MSN 56.8
## 9 BHM 56.2
## 10 CVG 55.1
## # … with 94 more rows
min(x)
,
quantile(x, 0.25)
, max(x)
.Quantiles are a generalisation of the median. For example,
quantile(x, 0.25)
will find a value of x
that
is greater than 25% of the values, and less than the remaining 75%.
For example, if we hope to know the first and last flights leave on each day, we may do the following:
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
first = min(dep_time),
last = max(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
first(x)
, nth(x, 2)
,
last(x)
. These work similarly to x[1]
,
x[2]
, and x[length(x)]
but let you set a
default value if that position does not exist (i.e. you’re trying to get
the 3rd element from a group that only has two elements). For example,
we can find the first and last departure for each day:summarise(my_data,
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
n()
, which takes no arguments, returns the size
of the current group. To count the number of non-missing values, use
sum(!is.na(x))
. To count the number of distinct (unique)
values, use n_distinct(x)
.Counting is so frequently used for data analysis that there is a
separate function count()
to do the job.
count(<data_name>, <variable_names>)
is
equivalent to first group by those variables then count the number of
samples in each group.
count(flights, origin)
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
my_data <- group_by(not_canceled, dest)
my_sum <- summarise(my_data, carriers = n_distinct(carrier))
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
%>%
In most examples so far, we need to do a sequence of operations on the data set to fulfill our task. For example:
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_sum <- summarise(my_data, carriers = n_distinct(carrier))
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
As we see, we have to assign the transformed data to a new or existing data set in every step to save the intermediate results. when there are many steps to do, this is quite cumbersome and inconvenient. The code also becomes less clear for readers.
Instead, we can use the pipe operator
%>%
introduced in the package magrittr
,
which is automatically loaded when we load tidyverse
so we
don’t have to explicitly load it.
The code after using the pipe for the previous example looks like this:
my_data <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers)) %>%
print()
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
So the template for the pipe is:
<new_data_name> <- <original_data_name> %>%
operation1() %>%
operation2() %>%
...
Most importantly, in every operation function (basically all
functions we have introduced so far from delyr
) we can now
ignore the data set name as the first argument and
directly input the operations.
my_data <- filter(diamonds, carat < 2)
my_data <- group_by(my_data, color)
my_summary <- summarise(my_data, mean_carat = mean(carat), mean_price = mean(price))
my_summary <- mutate(my_summary, mean_unit_price = mean_price/mean_carat)
print(my_summary)
## # A tibble: 7 × 4
## color mean_carat mean_price mean_unit_price
## <ord> <dbl> <dbl> <dbl>
## 1 D 0.646 3063. 4743.
## 2 E 0.643 2944. 4577.
## 3 F 0.716 3547. 4952.
## 4 G 0.737 3720. 5047.
## 5 H 0.828 3765. 4548.
## 6 I 0.884 3814. 4314.
## 7 J 0.980 3838. 3917.
Now let’s exercise using %>%
to do analysis.
We will analyze the data set babies
in the package
openintro
. Use the help documents to understand the meaning
of variables.
This can be done by data visualization alone without any transformation.
ggplot(babies) + geom_histogram(aes(age), binwidth = 1, boundary = 20.5) +
labs(title = "Analysis of 'babies' Data Set", x = "Mother's age when giving birth") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We can create a box plot to study this question. Again this only involves data visualization.
ggplot(babies) +
geom_boxplot(mapping = aes(x = as.factor(smoke), y = gestation))
This figure does not look very nice, so we may consider summarizing
data instead of creating a graph. Note that we will create a new
categorical variable with labels “non-smoker”, “smoker” and “NA” to
replace “0”, “1” and “NA” in smoke
variable.
smoke_gest <- babies %>%
mutate(smoke_label = fct_recode(as.factor(smoke), "non-smoker" = "0", "smoker" = "1")) %>%
group_by(smoke_label) %>%
summarise(n = n(), mean_gest = mean(gestation, na.rm = TRUE), median_gest = median(gestation, na.rm = TRUE)) %>%
print()
## # A tibble: 3 × 4
## smoke_label n mean_gest median_gest
## <fct> <int> <dbl> <dbl>
## 1 non-smoker 742 280. 281
## 2 smoker 484 278. 279
## 3 <NA> 10 282. 280.
As we see, there is little difference in the measures and it seems that smoking or not does not matter to the gestation length.
babies
data set, investigate whether there
is a correlation between mother’s weight and the gestation length.The last example introduces the concept of EDA, which accounts for “Exploratory Data Analysis”. This is the goal of our course - teaching you how to do data exploration.
Go back to the diagram of the first lecture, we see that EDA is an iterative cycle (the part in the grey box). You:
Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind.
During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.
The term data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
In the next lecture, we will exercise doing more EDA analysis.ß