We will be working with data from the nycflights13 package, and use ggplot2 to help us understand the data. So we will need to load the library for nycflights13 and tidyverse.
There will be a conflict error message when you load tidyverse after dplyr. Dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you will need to use thier full names: stats::filter() and stats::lag().
Tibbles are data frames, but slightly tweaked to work better in the tidyverse. Each column has different abbreviation under them:
All verbs work similarly. 1. The first argument is a data frame 2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes) 3. The sesult is a new data frame.
The underlying data is never modified. To save use assignment operator [ <- ]
In propositional logic and boolean algebra, De Morgan’s laws[1][2][3] are a pair of transformation rules that are both valid rules of inference. They are named after Augustus De Morgan, a 19th-century British mathematician. The rules allow the expression of conjunctions and disjunctions purely in terms of each other via negation.
The rules can be expressed in English as: the negation of a conjunction is the disjunction of the negations; and the negation of a disjunction is the conjunction of the negations;
or the complement of the union of two sets is the same as the intersection of their complements; and the complement of the intersection of two sets is the same as the union of their complements.
For example, if you were to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of hte following two filters:
Multiple arguments are equivalent to and
filter(mtcars, cyl < 6, vs == 1) There are also && and || but it will taught later on under “Conditional execution”
Missing Values
Missing values or “NAs” (not available) represent an unknown value so missing values are “contagious”. Almost any operation involving an unknown value will also be unkown.
NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA
# but this one is confusing
NA == NA
[1] NA
To test of NA, use is.na()
x <- NA
y <- NA
is.na(x)
[1] TRUE
Filter() only includes rows where the condition is TRUE. It excludes both FALSE and NA. So if you want to preserve missing values, ask for them explicitly
In this case, the NA is not listed.
df <- tibble(x=c(1,NA,3))
filter(df,x>1)
In this case, the NA is listed.
filter(df, is.na(x) | x>1)
Exercises 1. find all flights that: a. had an arrival delay of two hours or more:
filter(flights, arr_delay>120)
- flew to Houston (IAH or HOU)
filter(flights, dest %in% c("IAH","HOU"))
- Were operated by United, American or Delta
filter(flights, carrier %in% c("UA","AA","DL"))
- Departed in summer
filter(flights, month>6 & month<10)
- Arrived more than 2 hours late, but didn’t leave late
filter(flights, arr_delay>120, dep_delay<2)
- Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, arr_delay<=90, dep_delay>120)
- Departed between midnight and 6am inclusive
filter(flights, dep_time>=000, dep_time<=600)
Use between() to simplify the code:
filter(flights, between(dep_time,000,600))
- How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
# list missing dep_time
filter(flights, is.na(dep_time))
Answer: dep_delay, arr_time, arr_delay are all missing. This might represent cancelled flights.
Now we can also sort in descending order using the Arrange() verb.
arrange(filter(flights, is.na(dep_time)),desc(month))
Arrange Rows with arrange()
Arrange(data, arg1, arg2, arg3…) If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. You can also use desc(arg) to reorder a column in descending order.
arrange(flights, year, month, day)
arrange(flights, year, desc(month), day)
Missing values are always sorted at the end. To sort missing values at the start, use is.na()
arrange(flights, dep_time,is.na(dep_time))
Note: Couldnt get NA to show up in the view.
Select columns with select()
Select() allows you to zoom in on a useful subset using operations based on the names of the variables.
# show only year, month and day
select(flights, year, month, day)
So notice that there are only 3 columns now.
# select all columns between year and day inclusive
select(flights, year:day)
# select all columns EXCEPT those from year to day
select(flights, -(year:day))
Helper functions within select()
starts_with(“abc”)
ends_with(“abc”)
contains(“ijk”)
matches(“(.)\1”)
num_range(“x”,1:3) matches x1,x2,x3
see ?select for more details.
Rename
While you can use select() to rename variables, it is not useful because it drops all of the variables not explicitly mentioned. For this, use the rename() that keeps all the variables that aren’t explicitly mentioned.
rename(flights, DepartureTime = dep_time)
What happens when we use a variable mulitple times in a select call?
select(flights, origin, dest, air_time, origin)
select(flights, contains("TIME"))
Add new variables with Mutate()
Use the mutate function when you want to add new variables that are a function of existing columns. Mutate always adds new variables to the end of the dataset.
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time)
mutate(flights_sml, gain=arr_delay -dep_delay, speed= distance /air_time * 60)
Note: gain and speed are the new columns generated by mutate() If you only want to keep the newly generated variables use transmut()
transmute(flights, gain=arr_delay -dep_delay, speed= distance /air_time * 60)
Useful Creation functions
Modular arithemetic (%/% and %%)
Integer Division (%/%) and remainder (%%) You can use this in the NYC Flightst
library(tidyverse)
library(nycflights13)
transmute(flights, dep_time, hour=dep_time %/% 100, minute=dep_time %% 100)
Logs log(),log2() , log10()
Logarithms are an incredibly useful transformation for dealling with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive
h3> Offsets
lead() and Lag() allowyou to refer to leading or lagging values. This can be used to compute running differences x-lag(x) or find when values change (x!=lag(x)). They are most useful in conjunction with group_by()
x<-1:10
# display x values
x
[1] 1 2 3 4 5 6 7 8 9 10
# display lag(x)
lag(x)
[1] NA 1 2 3 4 5 6 7 8 9
# display lead(x)
lead(x)
[1] 2 3 4 5 6 7 8 9 10 NA
Cummulative and rolling aggregates
Functions for running sums, products, mins, and maxes: cumsum(), cumprod(), cummin(), cummax() and dplyr provides cummean() for cummulative means. Try also the RcppRoll package
# print x again
x
[1] 1 2 3 4 5 6 7 8 9 10
# what is the cummulative sum of x?
cumsum(x)
[1] 1 3 6 10 15 21 28 36 45 55
# what is the cummulative mean of x?
cummean(x)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
Ranking
Try min_rank() smallest values= smallest ranks
# set up variable
y <- c(1,2,2,NA,3,4)
# display Y values
y
[1] 1 2 2 NA 3 4
# display rank
min_rank(y)
[1] 1 2 2 NA 4 5
min_rank(desc(y))
[1] 5 3 3 NA 2 1
If rank() doesnt do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), and ntile().
y
[1] 1 2 2 NA 3 4
row_number(y)
[1] 1 2 3 NA 4 5
dense_rank(y)
[1] 1 2 2 NA 3 4
cume_dist(y)
[1] 0.2 0.6 0.6 NA 0.8 1.0
percent_rank(y)
[1] 0.00 0.25 0.25 NA 0.75 1.00
Grouped Summaries with Summarize()
summarize() collapses a data frame to a single row.
library(nycflights13)
summarize(flights, delay=mean(dep_delay, na.rm=TRUE))
Enhance summarize() with group_by to get summaries by group of data.
library(nycflights13)
by_day <- group_by(flights, year, month, day)
summarize(by_day, delay=mean(dep_delay, na.rm=TRUE))
Then we can combine multiple operations with the pipe Suppose we want to find the relationship between the distance and average delay for each location.
# Old method without using Pipes
library(nycflights13)
by_dest <- group_by(flights,dest)
delay <- summarize(by_dest, count=n(), dist=mean(distance, na.rm=TRUE), delay=mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count>20, dest !="HNL")
delay
# let us plot the values to see the correlation
ggplot(data=delay, mapping=aes(x=dist, y=delay))+
geom_point(aes(size=count), aplpha=1/3)+
geom_smooth(se=FALSE)
Ignoring unknown parameters: aplpha

Now, let us see how Pipe (%>%) will make the code more efficient:
delays <- flights %>%
group_by(dest) %>%
summarize(count=n(), dist=mean(distance, na.rm=TRUE), delay=mean(arr_delay, na.rm=TRUE))%>%
filter(count>20, dest !="HNL")
ggplot(data=delay, mapping=aes(x=dist, y=delay))+
geom_point(aes(size=count), alpha=1/3)+
geom_smooth(se=FALSE)

A good way to think of Pipe (%>%) is by reading it as “then”. Behind the scenes, x %>% f(y) means f(x,y) x %>% f(y) %>% g(z) turns into g(f(x,y),z) Read left to right, top to bottom.
Missing Values
na.rm argument is use to take care of missing values.
# see how many missing values there are
flights %>% group_by(year,month,day) %>%
summarize(mean=mean(dep_delay))
Fortunately, all aggregate functions have the na.rm argument (NA remove) prior to computation
flights %>% group_by(year,month,day) %>%
summarize(mean=mean(dep_delay, na.rm =TRUE))
Taking care of cancelled flights
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year,month, day) %>%
summarize(mean=mean(dep_delay))
# no need to use na.rm since all missing values have been filter out already
Counts
Its a good idea to include either a count n() or a count of nonmissing values sum(!is.na(x)) so that you know the sample size.
For example this below shows an average delay of 5 hours!
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(delay=mean(arr_delay))
ggplot(data=delays, mapping=aes(x=delay))+
geom_freqpoly(binwidth=10)

We can get insight if we draw a scatterplot of number of lights versus average delay:
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(delay=mean(arr_delay, na.rm=TRUE), n= n())
ggplot(data=delays, mapping=aes(x=n, y=delay))+
geom_point(alpha=1/10)

There are wider variations when there are lesser flights than when there are a lot of flights. So get more details, let us filter out the wild patterns when n>25
delays %>%
filter(n>25) %>%
ggplot(mapping=aes(x=n, y=delay))+
geom_point(alpha=1/10)

Look at avverage performance of baseball batters to numbe of times they’re at a bat. Using Lahman package to compute the batting average (number of hits/number of attempts) of every majore league baseball player.
# you need to install the Lahman package first for this to work
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
group_by(playerID) %>%
summarize(
ba=sum(H, na.rm=TRUE)/sum(AB, na.rm=TRUE),
ab=sum(AB, na.rm=TRUE)
)
batters %>%
filter(ab >100) %>%
ggplot(mapping=aes(x=ab, y=ba))+
geom_point()+
geom_smooth(se=FALSE)

NA
The implication, if you naively sort on desc(ba) the pople with the best batting averages are clearly lucky…not skilled.
batters %>%
arrange(desc(ba))
Useful Summary Functions
Measures of location
Median is a value where 50% of x is above it, and 50% is below it. Try combining aggregations with subsetting
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
avg_delay1=mean(arr_delay),
avg_delay2=mean(arr_delay[arr_delay>0])
)
measures of spread
Mean squared deviation, or standard deviation or SD for short is the standard measure of spread. The interquartile range IQR() and median absoute deviation mad(x) are robust equivalents that may be more useful if you have outliers.
# why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd=sd(distance))%>%
arrange(desc(distance_sd))
Measures of rank min(x), quantile(x,0.25), max(x)
Qunatiles are a generalization of the median. Quantile(x,0.25) will find a value of x that is greater than 25% of the values and less than the remaining 75%
# when do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month,day) %>%
summarize(first=min(dep_time), last=max(dep_time))
Measures of poistion first(x), nth(x,2) last(x)
# find the first and last departure for each day
not_cancelled %>%
group_by(year, month,day) %>%
summarize(first_dep=first(dep_time), last_dep=last(dep_time))
These functions are complementary to filtering on ranks
not_cancelled %>%
group_by(year, month,day) %>%
mutate(r=min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
To count the number of distinct values
not_cancelled %>%
group_by(dest) %>%
summarize(carriers=n_distinct(carrier)) %>%
arrange(desc(carriers))
You can use a weight variable. You can use this to count the total number of miles a plane flew.
not_cancelled %>%
count(tailnum,wt=distance)
Counts and proportions of logical values
When you want to count the number of trues (1) and false(0)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early =sum(dep_time<500))
Grouping Multiple Variables
When you group by multiple variables , each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset.
daily <- group_by(flights, year, month, day)
(per_day <- summarize(daily, flights=n()))
(per_month <- summarize(per_day, flights=sum(flights)))
# per year
(per_year <- summarize(per_month, flights=sum(flights)))
To remove groupings, use Ungroup()
daily %>%
ungroup() %>%
summarize(flights=n())
Grouped Mutates and Filters
Grouping is most useful in conjunction with summarize(), but you can also do convenient operations with mutate() and filter()
# find the worst members of each group
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay))<10)
# find all groups bigger than a threshold
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() >365)
# display
popular_dests
# standardie to compute per group metrics
popular_dests <- flights %>%
filter(arr_delay>0) %>%
mutate(prop_delay=arr_delay /sum(arr_delay)) %>%
select(year:day,dest,arr_delay, prop_delay)
# display
popular_dests
---
title: "R for Data ScienceChapter 3"
output: html_notebook
---

<h1> Chapter 3 Data Transformation with dplyr </h1>
You will learn how to transform your data with dplyr package.
<h2> Pre requisites </h2>
We will be working with data from the nycflights13 package, and use ggplot2 to help us understand the data. So we will need to load the library for nycflights13 and tidyverse.</p>

There will be a conflict error message when you load tidyverse after dplyr. Dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you will need to use thier full names: stats::filter() and stats::lag(). </p>

<h2> Explore Nycflights data </h2>

```{r}
# you may need to install.packages("nycflights13")

library(nycflights13)
library(tidyverse)

# flights contain 336,776 records of flights that departed from NY in 2013
?flights

# to see the whole dataset use the function view()
View(flights)
# it prints as a tibble
```

Tibbles are data frames, but slightly tweaked to work better in the tidyverse. 
Each column has different abbreviation under them: </br>
<li>int - integer </li>
<li>dbl - doubles or real numbers </li>
<li>chr - character vector or strings </li>
<li>dttm - date time</li>
<li>lgl - logical </li>
<li>fctr - factors used by R to represent categorical values </li>

<h2> Dplyr Basisc </h2>

Five Key Dplr functions  for data manipulation: </br>
<li> filter() to filter rows or data points</li>
<li> arrange() to sort reorder the data</li>
<li> mutate() to create new variables with functions of existing variables</li>
<li> summarize() to collapse many values down to a single summary </li>
<li> group_by() changes the scope of each function from operating on the entire dataset to operating on a group by group basis</li> </p>

All verbs work similarly. </br>
1. The first argument is a data frame</br>
2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes)</br>
3. The sesult is a new data frame. </p>

```{r}
# filter rows

filter(flights, month==1, day ==1)

```

The underlying data is never modified. To save use assignment operator [ <- ]

<h2> Comparisons </h2>
use the == for equal, >,>=, <,<=!=
</br>


<h2> Logical operators </h2>
<li>& for AND </li>
<li>| for OR </li>
<li>! for NOT </li>
</p>

The following code will find all flights that departed in November or December

```{r}
filter(flights, month==11 | month==12)
```

or use the x %in% y approach

```{r}
filter(flights, month %in% c(11,12))
```

See also https://en.wikipedia.org/wiki/De_Morgan%27s_laws </br>
In propositional logic and boolean algebra, De Morgan's laws[1][2][3] are a pair of transformation rules that are both valid rules of inference. They are named after Augustus De Morgan, a 19th-century British mathematician. The rules allow the expression of conjunctions and disjunctions purely in terms of each other via negation.</p>

The rules can be expressed in English as:</br>
the negation of a conjunction is the disjunction of the negations; and</br>
the negation of a disjunction is the conjunction of the negations;</p>
or</br>
the complement of the union of two sets is the same as the intersection of their complements; and</br>
the complement of the intersection of two sets is the same as the union of their complements.</br>

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Demorganlaws.svg/330px-Demorganlaws.svg.png" >

For example, if you were to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of hte following two filters:

```{r}
filter(flights, !(arr_delay>120 | dep_delay>120))
```

or

```{r}
filter(flights, arr_delay<=120, dep_delay<=120)
```
Examples </br>

filter(mtcars, cyl == 8)</br>
filter(mtcars, cyl < 6)</br>

# Multiple criteria</br>
filter(mtcars, cyl < 6 & vs == 1)</br>
filter(mtcars, cyl < 6 | vs == 1)</br>

# Multiple arguments are equivalent to and</br>
filter(mtcars, cyl < 6, vs == 1)</br>
There are also && and || but it will taught later on under "Conditional execution"
</p>


<h2> Missing Values </h2>
Missing values or "NAs" (not available) represent an unknown value so missing values are "contagious". Almost any operation involving an unknown value will also be unkown.

```{r}
NA > 5
10 == NA
NA + 10

# but this one is confusing
NA == NA


```

To test of NA, use is.na()

```{r}
x <-  NA
y <-  NA

is.na(x)


```

Filter() only includes rows where the condition is TRUE. It excludes both FALSE and NA. So if you want to preserve missing values, ask for them explicitly

In this case, the NA is not listed.

```{r}
df <- tibble(x=c(1,NA,3))
filter(df,x>1)
```


In this case, the NA is listed.

```{r}
filter(df, is.na(x) | x>1)
```

Exercises </br>
1. find all flights that:
a. had an arrival delay of two hours or more:

```{r}

filter(flights, arr_delay>120)

```

b. flew to Houston (IAH or HOU)

```{r}

filter(flights, dest %in% c("IAH","HOU"))

```

c. Were operated by United, American or Delta
```{r}
filter(flights, carrier %in% c("UA","AA","DL"))
```

d. Departed in summer

```{r}
filter(flights, month>6 & month<10)
```

e. Arrived more than 2 hours late, but didn't leave late

```{r}
filter(flights, arr_delay>120, dep_delay<2)
```

f. Were delayed by at least an hour, but made up over 30 minutes in flight

```{r}
filter(flights, arr_delay<=90, dep_delay>120)
```

g. Departed between midnight and 6am inclusive
```{r}
filter(flights, dep_time>=000, dep_time<=600)
```

Use <b>between()</b> to simplify the code:
```{r}
filter(flights, between(dep_time,000,600))
```

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

```{r}
# list missing dep_time
filter(flights, is.na(dep_time))
```

Answer: dep_delay, arr_time, arr_delay are all missing. This might represent cancelled flights.

Now we can also sort in descending order using the Arrange() verb.

```{r}
arrange(filter(flights, is.na(dep_time)),desc(month))
```


<h2> Arrange Rows with arrange() </h2>
Arrange(data, arg1, arg2, arg3...)
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. You can also use desc(arg) to reorder a column in descending order.

```{r}
arrange(flights, year, month, day)
```

```{r}
arrange(flights, year, desc(month), day)
```
Missing values are always sorted at the end. 
To sort missing values at the start, use is.na()

```{r}
arrange(flights, dep_time,is.na(dep_time))
```

Note: Couldnt get NA to show up in the view.
<h2> Select columns with select() </h2>
Select() allows you to zoom in on a useful subset using operations based on the names of the variables.

```{r}
# show only year, month and day
select(flights, year, month, day)
```

So notice that there are only 3 columns now. 

```{r}
# select all columns between year and day inclusive
select(flights, year:day)
```

```{r}
# select all columns EXCEPT those from year to day 
select(flights, -(year:day))
```

<h3> Helper functions within select() </h3>
<li>starts_with("abc")</li>
<li>ends_with("abc")</li>
<li>contains("ijk") </li>
<li>matches("(.)\\1") </li>
<li>num_range("x",1:3) matches x1,x2,x3  </li> </p>

see ?select for more details. 

<h3> Rename </h3>
While you can use select() to rename variables, it is not useful because it drops all of the variables not explicitly mentioned. For this, use the rename() that keeps all the variables that aren't explicitly mentioned.

```{r}
rename(flights, DepartureTime = dep_time)
```



What happens when we use a variable mulitple times in a select call?
```{r}
select(flights, origin, dest, air_time, origin)
```

```{r}
select(flights, contains("TIME"))
```

<h2> Add new variables with Mutate() </h2>
Use the mutate function when you want to add new variables that are a function of existing columns. </br>
Mutate always adds new variables to the end of the dataset.

```{r}
flights_sml <-  select(flights, year:day, ends_with("delay"), distance, air_time)
mutate(flights_sml, gain=arr_delay -dep_delay, speed= distance /air_time * 60)
```

Note: gain and speed are the new columns generated by mutate()
</br>
If you only want to keep the newly generated variables use transmut()

```{r}
transmute(flights, gain=arr_delay -dep_delay, speed= distance /air_time * 60)
```


<h2> Useful Creation functions </h2>

<h3>Modular arithemetic (%/% and %%) </h3>
Integer Division (%/%) and remainder (%%) You can use this in the NYC Flightst 

```{r}
# convert dep_time ito hours and minutes
library(tidyverse)
library(nycflights13)
transmute(flights, dep_time, hour=dep_time %/% 100, minute=dep_time %% 100)

```
<h3> Logs log(),log2() , log10()</h3>
Logarithms are an incredibly useful transformation for dealling with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive </P>

>h3> Offsets </h3>
lead() and Lag() allowyou to refer to leading or lagging values. This can be used to compute running differences  x-lag(x) or find when values change (x!=lag(x)).  They are most useful in conjunction with group_by() 
</p>

```{r}
x<-1:10
# display x values
x
# display lag(x)
lag(x)
# display lead(x)
lead(x)

```

<h3> Cummulative and rolling aggregates </h3>
Functions for running sums, products, mins, and maxes: cumsum(), cumprod(), cummin(), cummax() and dplyr provides cummean() for cummulative means. Try also the RcppRoll package

```{r}
# print x again
x
# what is the cummulative sum of x?
cumsum(x)
# what is the cummulative mean of x?
cummean(x)
```


<h3> Ranking</h3>
Try min_rank() smallest values= smallest ranks

```{r}

# set up variable
y <- c(1,2,2,NA,3,4)
# display Y values
y
# display rank
min_rank(y)
min_rank(desc(y))

```


If rank() doesnt do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), and ntile().

```{r}
y 
row_number(y)
dense_rank(y)
cume_dist(y)
percent_rank(y)
```

<h2> Grouped Summaries with Summarize() </h2>
summarize() collapses a data frame to a single row. 

```{r}
library(nycflights13)
summarize(flights, delay=mean(dep_delay, na.rm=TRUE))
```


Enhance summarize() with group_by to get summaries by group of data.

```{r}
library(nycflights13)
by_day <- group_by(flights, year, month, day)
summarize(by_day, delay=mean(dep_delay, na.rm=TRUE))
```

Then we can combine multiple operations with the pipe </br>
Suppose we want to find the relationship between the distance and average delay for each location. 

```{r}
# Old method without using Pipes
library(nycflights13)
by_dest <- group_by(flights,dest)
delay <- summarize(by_dest, count=n(), dist=mean(distance, na.rm=TRUE), delay=mean(arr_delay, na.rm = TRUE))
# filter out HNL honolulu as it is twice as far away as the next closest airport
delay <- filter(delay, count>20, dest !="HNL")
delay
```

```{r}
# let us plot the values to see the correlation
ggplot(data=delay, mapping=aes(x=dist, y=delay))+
  geom_point(aes(size=count), alpha=1/3)+
  geom_smooth(se=FALSE)

```

Now, let us see how Pipe <b>(%>%)</b> will make the code more efficient:

```{r}
delays <- flights %>% 
  group_by(dest) %>% 
  summarize(count=n(),  dist=mean(distance, na.rm=TRUE),  delay=mean(arr_delay, na.rm=TRUE))%>% 
  filter(count>20, dest !="HNL")

ggplot(data=delay, mapping=aes(x=dist, y=delay))+
  geom_point(aes(size=count), alpha=1/3)+
  geom_smooth(se=FALSE)
```

A good way to think of Pipe <b>(%>%)</b> is by reading it as "then".
Behind the scenes, x %>% f(y) means f(x,y) </br>
x %>% f(y) %>% g(z) turns into g(f(x,y),z) </br>
Read left to right, top to bottom. 

<h2> Missing Values </h2>
na.rm argument is use to take care of missing values. 

```{r}
# see how many missing values there are
flights %>% group_by(year,month,day) %>%
  summarize(mean=mean(dep_delay))

```


Fortunately, all aggregate functions have the na.rm argument (NA remove) prior to computation

```{r}
flights %>% group_by(year,month,day) %>%
  summarize(mean=mean(dep_delay, na.rm =TRUE))
```

Taking care of cancelled flights

```{r}
not_cancelled <-  flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
  group_by(year,month, day) %>%
  summarize(mean=mean(dep_delay))
# no need to use na.rm since all missing values have been filter out already
```

<h2> Counts </h2>
Its a good idea to include either a count n() or a count of nonmissing values sum(!is.na(x)) so that you know the sample size. 

For example this below shows an average delay of 5 hours!

```{r}
delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarize(delay=mean(arr_delay))

ggplot(data=delays, mapping=aes(x=delay))+
  geom_freqpoly(binwidth=10)
```

We can get insight if we draw a scatterplot of number of lights versus average delay:

```{r}
delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarize(delay=mean(arr_delay, na.rm=TRUE), n= n())

ggplot(data=delays, mapping=aes(x=n, y=delay))+
  geom_point(alpha=1/10)
```
There are wider variations when there are lesser flights than when there are a lot of flights.
So get more details, let us filter out the wild patterns when n>25

```{r}
delays %>%
  filter(n>25) %>%
  ggplot(mapping=aes(x=n, y=delay))+
  geom_point(alpha=1/10)

```

Look at avverage performance of baseball batters to numbe of times they're at a bat. Using Lahman package to compute the batting average (number of hits/number of attempts) of every majore league baseball player.

```{r}
# you need to install the Lahman package first for this to work 
batting <- as_tibble(Lahman::Batting)

batters <-  batting %>%
  group_by(playerID) %>%
  summarize(
    ba=sum(H, na.rm=TRUE)/sum(AB, na.rm=TRUE),
    ab=sum(AB, na.rm=TRUE)
  )

batters %>%
  filter(ab >100) %>%
  ggplot(mapping=aes(x=ab, y=ba))+
  geom_point()+
  geom_smooth(se=FALSE)

    
```
The implication, if you naively sort on desc(ba) the pople with the best batting averages are clearly lucky...not skilled.


```{r}
batters %>%
  arrange(desc(ba))
```

<h2> Useful Summary Functions </h2>






<h3> Measures of location  </h3>
Median is a value where 50% of x is above it, and 50% is below it. 
Try combining aggregations with subsetting

```{r}
not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    avg_delay1=mean(arr_delay),
    avg_delay2=mean(arr_delay[arr_delay>0])
  )
```

<h3> measures of spread </h3>
Mean squared deviation, or standard deviation or SD for short is the standard measure of spread. The interquartile range IQR() and median absoute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

```{r}
# why is distance to some destinations more variable than to others?
not_cancelled %>%
  group_by(dest) %>%
  summarise(distance_sd=sd(distance))%>%
  arrange(desc(distance_sd))
```

<h3> Measures of rank min(x), quantile(x,0.25), max(x) </h3>
Qunatiles are a generalization of the median. Quantile(x,0.25) will find a value of x that is greater than 25% of the values and less than the remaining 75%

```{r}
# when do the first and last flights leave each day?
not_cancelled %>%
  group_by(year, month,day) %>%
  summarize(first=min(dep_time), last=max(dep_time))
```

<h3> Measures of poistion first(x), nth(x,2) last(x) </h3>
```{r}
# find the first and last departure for each day
not_cancelled %>%
  group_by(year, month,day) %>%
  summarize(first_dep=first(dep_time), last_dep=last(dep_time))
```

These functions are complementary to filtering on ranks 

```{r}
not_cancelled %>%
  group_by(year, month,day) %>%
  mutate(r=min_rank(desc(dep_time))) %>%
  filter(r %in% range(r))

```

To count the number of distinct values

```{r}
not_cancelled %>%
  group_by(dest) %>%
  summarize(carriers=n_distinct(carrier)) %>%
  arrange(desc(carriers))
```

You can use a weight variable. You can use this to count the total number of miles a plane flew.

```{r}
not_cancelled %>%
  count(tailnum,wt=distance)
```

<h3> Counts and proportions of logical values </h3>
When you want to count the number of trues (1) and false(0)

```{r}
not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(n_early =sum(dep_time<500))
```

<h3> Grouping Multiple Variables </h3>
When you group by multiple variables , each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset. 

```{r}
daily <- group_by(flights, year, month, day)
(per_day <- summarize(daily, flights=n()))
```
 
```{r}
(per_month <-  summarize(per_day, flights=sum(flights)))
```

```{r}
# per year
(per_year <- summarize(per_month, flights=sum(flights)))
```

To remove groupings, use Ungroup()

```{r}
daily %>%
  ungroup() %>%
  summarize(flights=n())
```

<h2> Grouped Mutates and Filters </h2>
Grouping is most useful in conjunction with summarize(), but you can also do convenient operations with mutate() and filter()

```{r}
# find the worst members of each group
flights_sml %>%
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay))<10)
```

```{r}
# find all groups bigger than a threshold
popular_dests <- flights %>%
  group_by(dest) %>%
  filter(n() >365)
# display
popular_dests

```

```{r}
# standardie to compute per group metrics
popular_dests <- flights %>%
   filter(arr_delay>0) %>%
  mutate(prop_delay=arr_delay /sum(arr_delay)) %>%
  select(year:day,dest,arr_delay, prop_delay)
# display
popular_dests
```

