Here is a short document to show how easy it is to shred data using tools other than Excel. For this example, we will use the nycflighs 2013 dataset. The first thing to do is to bring in the tools and data.

library(dplyr); library(magrittr); library(ggplot2);library(knitr) 
flights <- nycflights13::flights

Of these libraries, the one that is most important for this exercise is dplyr, which provides consistent, stable ‘verbs’ for data manipulation. Below we will use filter, group_by, summarize, rename, select. The main advantage to dplyr is that is provides backends that work well with a number of constructs. In short - once I have the data imported, it behaves the same regardless of the storage construct. This is no small thing!

Part-and-parcel with Dplyr is magrittr, which introduces the pipe %>% operator. This allows us to chain operators together and make ‘data sentences’. The general workflow is:

data %>% filter1 %>% filter...n %>% group_by %>% summarize %>% output

Remember: x %>% F() -> F(x), and x %>% F() %>% G() %>% H() -> H(G(F(x)))

This is a ‘mid-size’ dataset, with 336776 rows and 19 columns. We can take a ‘peek’ at the data here:

flights %>% head() %>% select(1:5, 10) %>% kable()
year month day dep_time sched_dep_time carrier
2013 1 1 517 515 UA
2013 1 1 533 529 UA
2013 1 1 542 540 AA
2013 1 1 544 545 B6
2013 1 1 554 600 DL
2013 1 1 554 558 UA

Subsetting

A major task for data science is subsetting; i.e. finding the ‘one’ observation that you are looking for in a ‘haystack’ of data. Clearly, going through all of this by hand is not the right approach.

Dplyr makes this easy by applying logical filters in a responsible, memory efficient way. This is a technical detail that most people don’t really care about, suffice to say that MS Excel handles memory in a very expensive, sloppy manner.

Suppose that I want to find all flights departing Newark between 6:00 and 8:30 AM between the first and third of January that are at least 5 hours long on United Airlines, with less than a one minute delay. Only show the columns of departure time, flight tailnumber and destination. Finally, rename the dest_time column to be ‘Time’

flights %>% filter(month == 1, day >0, day < 4, origin == "EWR", carrier == "UA", dep_time < 830, dep_time > 630,  dep_delay < 1, air_time > 5*60 ) %>% select(dep_time, flight, tailnum, dest) %>% rename(Time = dep_time) %>% kable()
Time flight tailnum dest
746 1668 N24224 SFO
745 1668 N37287 SFO
804 423 N528UA PDX
822 1067 N73259 LAX
701 1455 N13750 SNA
704 604 N573UA LAX

But wait, there’s more: Suppose we want to find which aircraft are the most frequently used? Can do easy!

flights %>% filter(tailnum != "NA") %>% group_by(tailnum) %>% summarize(count = n()) %>% top_n(wt = count, n = 10) %>% arrange(desc(count))%>% kable()
tailnum count
N725MQ 575
N722MQ 513
N723MQ 507
N711MQ 486
N713MQ 483
N258JB 427
N298JB 407
N353JB 404
N351JB 402
N735MQ 396

Anomily Detection

Suppose I want to find the anomalies in departure delays. In the medical field, you might want to look for outliers in billing or whatever.

Step 1 - plot the data. If you are using the empirical rule for outliers, i.e. \(\pm 2 \sigma\) you may have problems. That rule rests on a heavy assumption of normality (or bell shaped curve). Not all data is ‘normal’. While we have both positive and negative delays (i.e. times that the plane departed early), we really only care about the positive ones. Take a look at the graph of delays:

flights %>% filter(dep_delay > 0) %>% filter(dep_delay < 200) %>% ggplot(aes(x = dep_delay)) + geom_histogram() + xlab("Delay")

Statisticians are more comfortable with using the cumulative distribution (for reasons I can explain later). That looks like this:

flights %>% filter(dep_delay > 0) %>% ggplot(aes(dep_delay)) + stat_ecdf(geom = "point") + xlab("Delay") + ylab("Pr{X<x}")

This is clearly not the Normal distribution. To demonstrate this, we can compare the quantiles of the delay data to the theoretical normal distribution using qqnorm

flights %>% filter(dep_delay > 0) %$% qqnorm(dep_delay)

Note the use of the ‘explosion’ operator - %$% to address the column dep_delay directly in the graph command. Compare this with the qq plot of randomly generated normal data:

rnorm(5000) %>% qqnorm()

Instead of doing the normal (pun intended) thing, \(\mu - 1.96 \sigma < X < \mu + 1.96 \sigma\), we can determine the 95% critical values directly from the empirical cdf:

delays = flights$dep_delay
LL = delays[delays >0] %>% quantile(.025, na.rm = TRUE)
UL = delays[delays >0] %>% quantile(.975, na.rm = TRUE)

Using this empirical, 2-sided construct, we would find delays of less than 1 and greater than 193 to be outliers.

What does it mean to be an outlier?

Generally, we don’t care if the amount of delay time is less than average. Therefore, the amount of variability we care about is at the upper part of the distribution (Upper Tail). So, let’s find the outliers that way:

UL = delays[delays>0] %>% quantile(.95, na.rm = TRUE)
flights %>% filter(dep_delay > UL) %>% group_by(carrier) %>% summarize(Count = n(), Ave_Delay = mean(dep_delay, na.rm=TRUE), Max_Delay = max(dep_delay, na.rm = TRUE)) %>% top_n(wt = Count, n = 10) %>% arrange(desc(Count)) -> Outliers
Outliers %>% kable()
carrier Count Ave_Delay Max_Delay
EV 1538 204.5039 548
B6 1010 205.2574 502
UA 932 212.9979 483
DL 750 243.4667 960
9E 521 207.7274 747
AA 466 222.0064 1014
MQ 378 215.2090 1137
WN 331 219.1722 471
US 158 215.8734 500
VX 132 233.2273 653

In this dataset, 1 percent of the observations are outliers by this definition.

K-means

Finally, we can do k-means clustering on the delay times.

My fellow r programmers will note that with dplyr and magrittr constructs, the less frequently used ‘forward assignment’ -> is a very handy tool. This is particularly true when you begin with data and don’t realize until the end of the sentence that you wish to save the artifact for use later!

flights %>% filter(dep_delay >0) %>%   filter(!is.na(arr_delay)) %>% select(dep_delay, arr_delay) %>% rename(arrival = arr_delay, departure = dep_delay) ->kdat

kdat %>% kmeans(6) -> kc
kdat$cluster = factor(kc$cluster)
centers = as.data.frame(kc$centers)

ggplot(kdat, aes(x = departure, y = arrival, color = cluster)) + geom_point()

Admittedly, this is not a great example of k-means. This dataset has the interesting feature that the columns are either highly correlated or not at all; to show the usefulness of k-means, we will demonstrate against the iris dataset:

data(iris)

iris %>% rename(length = Petal.Length, width = Petal.Width) %>% select(length, width) ->ID

ID %>% kmeans(3) ->cl

ID$cluster = factor(cl$cluster)
centers = as.data.frame(cl$centers)

ID %>% ggplot(aes(x = length,y = width, color = cluster)) + geom_point() + geom_point(data = centers, aes(x = length, y = width, color = 'Center')) + geom_point(data=centers, aes(x = length,y=width, color='Center'), size=52, alpha=.3)+ theme(legend.position="none")

Scalability

The work shown here scales to a ‘production’ environment quite nicely. The dplyr commands used to subset and extract data are agnostic to the size of the datasets; the same commands work against 10 rows that work against 10 million rows (although it may require the use of tools like Hadoop and Spark - but these are invisible to the user). In other words, we point the scripts to a different file, and the analysis executes the same way.

Notes on Markdown.

You may notice that I use knitr::kable() to render tables in the markdown environment. This is not strictly necessary; each ‘dangling’ command at the end of a code block - i.e. one that does not explicitly render or save an artifact - has an implicit call to base::print(). I just think that kable looks better in .html. (I also think that relying on implicit function calls is sloppy, but that’s another story)

I also use ggplot2::ggplot for ‘nice’ graphics, but revert to base::qqnorm for quantile plots. The reason is simple - the base function works, is simple and fast. Additionally, I consider QQ plots to be ‘analyst’ graphs - very few clients would ever be interested in seeing one.