Here is a short document to show how easy it is to shred data using tools other than Excel. For this example, we will use the nycflighs 2013 dataset. The first thing to do is to bring in the tools and data.
library(dplyr); library(magrittr); library(ggplot2);library(knitr)
flights <- nycflights13::flights
Of these libraries, the one that is most important for this exercise is dplyr, which provides consistent, stable ‘verbs’ for data manipulation. Below we will use filter, group_by, summarize, rename, select
. The main advantage to dplyr is that is provides backends that work well with a number of constructs. In short - once I have the data imported, it behaves the same regardless of the storage construct. This is no small thing!
Part-and-parcel with Dplyr is magrittr
, which introduces the pipe %>%
operator. This allows us to chain operators together and make ‘data sentences’. The general workflow is:
data %>% filter1 %>% filter...n %>% group_by %>% summarize %>% output
Remember: x %>% F() -> F(x)
, and x %>% F() %>% G() %>% H() -> H(G(F(x)))
This is a ‘mid-size’ dataset, with 336776 rows and 19 columns. We can take a ‘peek’ at the data here:
flights %>% head() %>% select(1:5, 10) %>% kable()
year | month | day | dep_time | sched_dep_time | carrier |
---|---|---|---|---|---|
2013 | 1 | 1 | 517 | 515 | UA |
2013 | 1 | 1 | 533 | 529 | UA |
2013 | 1 | 1 | 542 | 540 | AA |
2013 | 1 | 1 | 544 | 545 | B6 |
2013 | 1 | 1 | 554 | 600 | DL |
2013 | 1 | 1 | 554 | 558 | UA |
A major task for data science is subsetting; i.e. finding the ‘one’ observation that you are looking for in a ‘haystack’ of data. Clearly, going through all of this by hand is not the right approach.
Dplyr makes this easy by applying logical filters in a responsible, memory efficient way. This is a technical detail that most people don’t really care about, suffice to say that MS Excel handles memory in a very expensive, sloppy manner.
Suppose that I want to find all flights departing Newark between 6:00 and 8:30 AM between the first and third of January that are at least 5 hours long on United Airlines, with less than a one minute delay. Only show the columns of departure time, flight tailnumber and destination. Finally, rename the dest_time column to be ‘Time’
flights %>% filter(month == 1, day >0, day < 4, origin == "EWR", carrier == "UA", dep_time < 830, dep_time > 630, dep_delay < 1, air_time > 5*60 ) %>% select(dep_time, flight, tailnum, dest) %>% rename(Time = dep_time) %>% kable()
Time | flight | tailnum | dest |
---|---|---|---|
746 | 1668 | N24224 | SFO |
745 | 1668 | N37287 | SFO |
804 | 423 | N528UA | PDX |
822 | 1067 | N73259 | LAX |
701 | 1455 | N13750 | SNA |
704 | 604 | N573UA | LAX |
But wait, there’s more: Suppose we want to find which aircraft are the most frequently used? Can do easy!
flights %>% filter(tailnum != "NA") %>% group_by(tailnum) %>% summarize(count = n()) %>% top_n(wt = count, n = 10) %>% arrange(desc(count))%>% kable()
tailnum | count |
---|---|
N725MQ | 575 |
N722MQ | 513 |
N723MQ | 507 |
N711MQ | 486 |
N713MQ | 483 |
N258JB | 427 |
N298JB | 407 |
N353JB | 404 |
N351JB | 402 |
N735MQ | 396 |
Suppose I want to find the anomalies in departure delays. In the medical field, you might want to look for outliers in billing or whatever.
Step 1 - plot the data. If you are using the empirical rule for outliers, i.e. \(\pm 2 \sigma\) you may have problems. That rule rests on a heavy assumption of normality (or bell shaped curve). Not all data is ‘normal’. While we have both positive and negative delays (i.e. times that the plane departed early), we really only care about the positive ones. Take a look at the graph of delays:
flights %>% filter(dep_delay > 0) %>% filter(dep_delay < 200) %>% ggplot(aes(x = dep_delay)) + geom_histogram() + xlab("Delay")
Statisticians are more comfortable with using the cumulative distribution (for reasons I can explain later). That looks like this:
flights %>% filter(dep_delay > 0) %>% ggplot(aes(dep_delay)) + stat_ecdf(geom = "point") + xlab("Delay") + ylab("Pr{X<x}")
This is clearly not the Normal distribution. To demonstrate this, we can compare the quantiles of the delay data to the theoretical normal distribution using qqnorm
flights %>% filter(dep_delay > 0) %$% qqnorm(dep_delay)
Note the use of the ‘explosion’ operator - %$%
to address the column dep_delay
directly in the graph command. Compare this with the qq plot of randomly generated normal data:
rnorm(5000) %>% qqnorm()
Instead of doing the normal (pun intended) thing, \(\mu - 1.96 \sigma < X < \mu + 1.96 \sigma\), we can determine the 95% critical values directly from the empirical cdf:
delays = flights$dep_delay
LL = delays[delays >0] %>% quantile(.025, na.rm = TRUE)
UL = delays[delays >0] %>% quantile(.975, na.rm = TRUE)
Using this empirical, 2-sided construct, we would find delays of less than 1 and greater than 193 to be outliers.
Generally, we don’t care if the amount of delay time is less than average. Therefore, the amount of variability we care about is at the upper part of the distribution (Upper Tail). So, let’s find the outliers that way:
UL = delays[delays>0] %>% quantile(.95, na.rm = TRUE)
flights %>% filter(dep_delay > UL) %>% group_by(carrier) %>% summarize(Count = n(), Ave_Delay = mean(dep_delay, na.rm=TRUE), Max_Delay = max(dep_delay, na.rm = TRUE)) %>% top_n(wt = Count, n = 10) %>% arrange(desc(Count)) -> Outliers
Outliers %>% kable()
carrier | Count | Ave_Delay | Max_Delay |
---|---|---|---|
EV | 1538 | 204.5039 | 548 |
B6 | 1010 | 205.2574 | 502 |
UA | 932 | 212.9979 | 483 |
DL | 750 | 243.4667 | 960 |
9E | 521 | 207.7274 | 747 |
AA | 466 | 222.0064 | 1014 |
MQ | 378 | 215.2090 | 1137 |
WN | 331 | 219.1722 | 471 |
US | 158 | 215.8734 | 500 |
VX | 132 | 233.2273 | 653 |
In this dataset, 1 percent of the observations are outliers by this definition.
Finally, we can do k-means clustering on the delay times.
My fellow r programmers will note that with dplyr
and magrittr
constructs, the less frequently used ‘forward assignment’ ->
is a very handy tool. This is particularly true when you begin with data
and don’t realize until the end of the sentence that you wish to save the artifact for use later!
flights %>% filter(dep_delay >0) %>% filter(!is.na(arr_delay)) %>% select(dep_delay, arr_delay) %>% rename(arrival = arr_delay, departure = dep_delay) ->kdat
kdat %>% kmeans(6) -> kc
kdat$cluster = factor(kc$cluster)
centers = as.data.frame(kc$centers)
ggplot(kdat, aes(x = departure, y = arrival, color = cluster)) + geom_point()
Admittedly, this is not a great example of k-means. This dataset has the interesting feature that the columns are either highly correlated or not at all; to show the usefulness of k-means, we will demonstrate against the iris dataset:
data(iris)
iris %>% rename(length = Petal.Length, width = Petal.Width) %>% select(length, width) ->ID
ID %>% kmeans(3) ->cl
ID$cluster = factor(cl$cluster)
centers = as.data.frame(cl$centers)
ID %>% ggplot(aes(x = length,y = width, color = cluster)) + geom_point() + geom_point(data = centers, aes(x = length, y = width, color = 'Center')) + geom_point(data=centers, aes(x = length,y=width, color='Center'), size=52, alpha=.3)+ theme(legend.position="none")
The work shown here scales to a ‘production’ environment quite nicely. The dplyr commands used to subset and extract data are agnostic to the size of the datasets; the same commands work against 10 rows that work against 10 million rows (although it may require the use of tools like Hadoop and Spark - but these are invisible to the user). In other words, we point the scripts to a different file, and the analysis executes the same way.
You may notice that I use knitr::kable()
to render tables in the markdown environment. This is not strictly necessary; each ‘dangling’ command at the end of a code block - i.e. one that does not explicitly render or save an artifact - has an implicit call to base::print()
. I just think that kable
looks better in .html. (I also think that relying on implicit function calls is sloppy, but that’s another story)
I also use ggplot2::ggplot
for ‘nice’ graphics, but revert to base::qqnorm
for quantile plots. The reason is simple - the base function works, is simple and fast. Additionally, I consider QQ plots to be ‘analyst’ graphs - very few clients would ever be interested in seeing one.