Here is a short document to show how easy it is to shred data using tools other than Excel. For this example, we will use the nycflighs 2013 dataset. The first thing to do is to bring in the tools and data.

library(dplyr); library(magrittr); library(ggplot2);library(knitr) 
flights <- nycflights13::flights

Of these libraries, the one that is most important for this exercise is dplyr, which provides consistent, stable ‘verbs’ for data manipulation. Below we will use filter, group_by, summarize, rename, select. The main advantage to dplyr is that is provides backends that work well with a number of constructs. In short - once I have the data imported, it behaves the same regardless of the storage construct. This is no small thing!

Part-and-parcel with Dplyr is magrittr, which introduces the pipe %>% operator. This allows us to chain operators together and make ‘data sentences’. The general workflow is:

data %>% filter1 %>% filter...n %>% group_by %>% summarize %>% output

Remember: x %>% F() -> F(x), and x %>% F() %>% G() %>% H() -> H(G(F(x)))

This is a ‘mid-size’ dataset, with 336776 rows and 19 columns. We can take a ‘peek’ at the data here:

flights %>% head() %>% select(1:5, 10) %>% kable()

year	month	day	dep_time	sched_dep_time	carrier
2013	1	1	517	515	UA
2013	1	1	533	529	UA
2013	1	1	542	540	AA
2013	1	1	544	545	B6
2013	1	1	554	600	DL
2013	1	1	554	558	UA

Subsetting

A major task for data science is subsetting; i.e. finding the ‘one’ observation that you are looking for in a ‘haystack’ of data. Clearly, going through all of this by hand is not the right approach.

Dplyr makes this easy by applying logical filters in a responsible, memory efficient way. This is a technical detail that most people don’t really care about, suffice to say that MS Excel handles memory in a very expensive, sloppy manner.

Suppose that I want to find all flights departing Newark between 6:00 and 8:30 AM between the first and third of January that are at least 5 hours long on United Airlines, with less than a one minute delay. Only show the columns of departure time, flight tailnumber and destination. Finally, rename the dest_time column to be ‘Time’

flights %>% filter(month == 1, day >0, day < 4, origin == "EWR", carrier == "UA", dep_time < 830, dep_time > 630,  dep_delay < 1, air_time > 5*60 ) %>% select(dep_time, flight, tailnum, dest) %>% rename(Time = dep_time) %>% kable()

Time	flight	tailnum	dest
746	1668	N24224	SFO
745	1668	N37287	SFO
804	423	N528UA	PDX
822	1067	N73259	LAX
701	1455	N13750	SNA
704	604	N573UA	LAX

But wait, there’s more: Suppose we want to find which aircraft are the most frequently used? Can do easy!

flights %>% filter(tailnum != "NA") %>% group_by(tailnum) %>% summarize(count = n()) %>% top_n(wt = count, n = 10) %>% arrange(desc(count))%>% kable()

tailnum	count
N725MQ	575
N722MQ	513
N723MQ	507
N711MQ	486
N713MQ	483
N258JB	427
N298JB	407
N353JB	404
N351JB	402
N735MQ	396

Anomily Detection

Suppose I want to find the anomalies in departure delays. In the medical field, you might want to look for outliers in billing or whatever.

Step 1 - plot the data. If you are using the empirical rule for outliers, i.e. $\pm 2 \sigma$ you may have problems. That rule rests on a heavy assumption of normality (or bell shaped curve). Not all data is ‘normal’. While we have both positive and negative delays (i.e. times that the plane departed early), we really only care about the positive ones. Take a look at the graph of delays:

flights %>% filter(dep_delay > 0) %>% filter(dep_delay < 200) %>% ggplot(aes(x = dep_delay)) + geom_histogram() + xlab("Delay")

Statisticians are more comfortable with using the cumulative distribution (for reasons I can explain later). That looks like this:

flights %>% filter(dep_delay > 0) %>% ggplot(aes(dep_delay)) + stat_ecdf(geom = "point") + xlab("Delay") + ylab("Pr{X<x}")

This is clearly not the Normal distribution. To demonstrate this, we can compare the quantiles of the delay data to the theoretical normal distribution using qqnorm

flights %>% filter(dep_delay > 0) %$% qqnorm(dep_delay)

Note the use of the ‘explosion’ operator - %$% to address the column dep_delay directly in the graph command. Compare this with the qq plot of randomly generated normal data:

rnorm(5000) %>% qqnorm()

Instead of doing the normal (pun intended) thing, $\mu - 1.96 \sigma < X < \mu + 1.96 \sigma$, we can determine the 95% critical values directly from the empirical cdf:

delays = flights$dep_delay
LL = delays[delays >0] %>% quantile(.025, na.rm = TRUE)
UL = delays[delays >0] %>% quantile(.975, na.rm = TRUE)

Using this empirical, 2-sided construct, we would find delays of less than 1 and greater than 193 to be outliers.

What does it mean to be an outlier?

Generally, we don’t care if the amount of delay time is less than average. Therefore, the amount of variability we care about is at the upper part of the distribution (Upper Tail). So, let’s find the outliers that way:

UL = delays[delays>0] %>% quantile(.95, na.rm = TRUE)
flights %>% filter(dep_delay > UL) %>% group_by(carrier) %>% summarize(Count = n(), Ave_Delay = mean(dep_delay, na.rm=TRUE), Max_Delay = max(dep_delay, na.rm = TRUE)) %>% top_n(wt = Count, n = 10) %>% arrange(desc(Count)) -> Outliers
Outliers %>% kable()

carrier	Count	Ave_Delay	Max_Delay
EV	1538	204.5039	548
B6	1010	205.2574	502
UA	932	212.9979	483
DL	750	243.4667	960
9E	521	207.7274	747
AA	466	222.0064	1014
MQ	378	215.2090	1137
WN	331	219.1722	471
US	158	215.8734	500
VX	132	233.2273	653

In this dataset, 1 percent of the observations are outliers by this definition.

K-means

Finally, we can do k-means clustering on the delay times.

My fellow r programmers will note that with dplyr and magrittr constructs, the less frequently used ‘forward assignment’ -> is a very handy tool. This is particularly true when you begin with data and don’t realize until the end of the sentence that you wish to save the artifact for use later!

flights %>% filter(dep_delay >0) %>%   filter(!is.na(arr_delay)) %>% select(dep_delay, arr_delay) %>% rename(arrival = arr_delay, departure = dep_delay) ->kdat

kdat %>% kmeans(6) -> kc
kdat$cluster = factor(kc$cluster)
centers = as.data.frame(kc$centers)

ggplot(kdat, aes(x = departure, y = arrival, color = cluster)) + geom_point()

Admittedly, this is not a great example of k-means. This dataset has the interesting feature that the columns are either highly correlated or not at all; to show the usefulness of k-means, we will demonstrate against the iris dataset:

data(iris)

iris %>% rename(length = Petal.Length, width = Petal.Width) %>% select(length, width) ->ID

ID %>% kmeans(3) ->cl

ID$cluster = factor(cl$cluster)
centers = as.data.frame(cl$centers)

ID %>% ggplot(aes(x = length,y = width, color = cluster)) + geom_point() + geom_point(data = centers, aes(x = length, y = width, color = 'Center')) + geom_point(data=centers, aes(x = length,y=width, color='Center'), size=52, alpha=.3)+ theme(legend.position="none")

Scalability

The work shown here scales to a ‘production’ environment quite nicely. The dplyr commands used to subset and extract data are agnostic to the size of the datasets; the same commands work against 10 rows that work against 10 million rows (although it may require the use of tools like Hadoop and Spark - but these are invisible to the user). In other words, we point the scripts to a different file, and the analysis executes the same way.

Notes on Markdown.

You may notice that I use knitr::kable() to render tables in the markdown environment. This is not strictly necessary; each ‘dangling’ command at the end of a code block - i.e. one that does not explicitly render or save an artifact - has an implicit call to base::print(). I just think that kable looks better in .html. (I also think that relying on implicit function calls is sloppy, but that’s another story)

I also use ggplot2::ggplot for ‘nice’ graphics, but revert to base::qqnorm for quantile plots. The reason is simple - the base function works, is simple and fast. Additionally, I consider QQ plots to be ‘analyst’ graphs - very few clients would ever be interested in seeing one.

Subsetting and Outlier Detection

Harrison

August 18, 2017