Big-ish Data Workflow in R

Introduction
The Data Workflow
- Setting up plotly in R
- Importing the CSV in SQLite
On-Disk Analysis in R with dplyr
- About dplyr
Part 2 Time Series Operations
- Aggregate Time Series
Contact me

Introduction

This is an R version of a recent article Big data analytics with Pandas and SQLite by plotly. I’d argue this is ‘medium’ data, however, I understand the need to generate a nice headline. For the R folks, I also expand on alternative packages and methods that may improve the workflow.

This document is not meant to make any claims that R is better or faster than Python for data analysis, I myself use both languages daily. The plotly article simply provides an opportunity to compare the two languages.

The data from this article is updated daily, therefore my version of the file is quite a bit larger at 4.63 GB. Therefore, expect the charts to return different results compared to the article.

Here are the details quoted from the article…

This notebook explores a 3.9GB CSV file containing NYC’s 311 complaints since 2003. It’s the most popular data set in NYC’s open data portal.

This notebook is a primer on out-of-memory data analysis with

pandas : A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
IPython notebook : An interface for writing and sharing python code, text, and plots.
SQLite : An self-contained, server-less database that’s easy to set-up and query from Pandas.
plotly : A platform for publishing beautiful, interactive graphs from Python to the web.

The dataset is too large to load into a Pandas dataframe. So, instead we’ll perform out-of-memory aggregations with SQLite and load the result directly into a dataframe with Panda’s iotools . It’s pretty easy to stream a CSV into SQLite and SQLite requires no setup. The SQL query language is pretty intuitive coming from a Pandas mindset.

Since we are using R, we will be substituting the python packages with the closest R counterpart, this is of course subjective.

Instead of pandas, we will be using dplyr
Instead of IPython notebook I have published this document using knitr / Rmarkdown via RStudio
I will maintain the use of SQLite, and plotly for R & ggplot2. However, for time series charts, I highly recommend the R package dygraphs, which provides similar interactive plotting, without requiring an internet connection.

Where possible I will outline to opporunities to substitute different R packages that in my opinion may improve the workflow.

The Data Workflow

Setting up plotly in R

plotly for R is not yet available on CRAN. We can instead download it from the rOpenSci github repo.

install.packages("devtools")
library("devtools")
install_github("ropensci/plotly")

library(plotly)

You’ll need to create an account to connect to the plotly API. Or you could just stick with the default ggplot2 graphics.

set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key

Importing the CSV in SQLite

This defeats the purpose of this article, but R can load this data into memory easily. If your computer resources permit it, the data will be much faster to operate on in-memory compared to an SQL database on disk. 8GB of RAM would be plenty in this case. In the spirit of a true comparison, I will replicate the on-disk analysis approach using an SQLite database, however, I will show a few benchmark’s using the data in memory using data.table along the way.

On-Disk Analysis in R with dplyr

While dplyr is capable of writing to databases, the data still must flow through R, which would probably be considered cheating in this case. The alternative is to simply to create a database and import the csv using the command line. Please create a pull request for this github repository if you have any other suggestions.

Here is the code I typed into the command line. Pretty easy right? This assumes you have sqlite3 installed and available on your PATH variable (hence accessible via terminal).

$ sqlite3 data.db # Create your database
$.databases       # Show databases to make sure it works
$.mode csv        
$.import <filename> <tablename>
# Where filename is the name of the csv & tablename is the name of the new database table
$.quit

Let’s also load the data into memory so we can compare in memory operations along the way. Here is a crude benchmark of file I/O in R. readr a recently released alternative to read.csv should also eat through this data quickly.

library(readr)
# data.table, selecting a subset of columns
time_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv', 
                   select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'), 
                   showProgress = T))
# Default data.table
time_data.table_full <- system.time(fread('/users/ryankelly/NYC_data.csv',
                                            showProgress = T))
# Default readr
time_readr <- system.time(read_csv('/users/ryankelly/NYC_data.csv'))

# Default base R (really slow - don't recommend running)
# time_base_r <- system.time(read.csv('/users/ryankelly/NYC_data.csv'))

kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))

	user.self	sys.self	elapsed
time_data.table	63.588	1.952	65.633
time_data.table_full	205.571	3.124	208.880
time_readr	277.720	5.018	283.029

I will be using data.table to read in the data. The fread function has the really nice ability to select the columns we want to read in, this speeds up the read quite a bit. The plotly article disregards all but the following columns from the dataset.

library(data.table)

dt <- fread('/users/ryankelly/NYC_data.csv', 
            select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'), 
            showProgress = F)

# Rename columns to remove spaces
setnames(dt, 'Created Date', 'CreatedDate')
setnames(dt, 'Closed Date', 'ClosedDate')
setnames(dt, 'Complaint Type', 'ComplaintType')

About dplyr

Switching between languages is cognitively challenging (especially because R and SQL are so perilously similar), so dplyr allows you to write R code that is automatically translated to SQL. The goal of dplyr is not to replace every SQL function with an R function: that would be difficult and error prone. Instead, dplyr only generates SELECT statements, the SQL you write most often as an analyst. - dplyr databases

dplyr is pretty fantastic because you can use the same syntax to:

Write dataframes to databases
Read from databases into memory
Operate on database via SQL wrappers
Operate on memory

In the event you need to run a more sophisticated query or if you prefer plain SQL, you can also simply pass in an SQL statement as a string, or generate your own sql wrappers. dplyr supports: sqlite, mysql, postgresql, and google’s bigquery. You will see that dplyr’s syntax is very similar to SQL.

For a more information about dplyr, see these two tutorials, which I borrow from heavily:

By default, dplyr queries will only pull the first 10 rows from the database, and never pulls data into R unless you ask for it directly.

library(dplyr)      ## Will be used for pandas replacement

# Connect to the database
db <- src_sqlite('/users/ryankelly/data.db')
db

## src:  sqlite 3.8.6 [/users/ryankelly/data.db]
## tbls: NYC_data

# Connect to the table of interest, which I called NYC_data
data <- tbl(db, 'NYC_data')
# We can pass SQL directly here to select only the columns of interest
# I 'rename' the columns with spaces to be able to properly access them in dplyr
data <- tbl(db, sql('SELECT Agency, "Created Date" AS CreatedDate, 
                    "Closed Date" AS ClosedDate, "Complaint Type" AS ComplaintType,  Descriptor, City
                    FROM NYC_data'))

Once we have to do a more expensive query, the data become pretty slow to work with. This would be apparent in both Python or R. This is why you would rather operate on the data in-memory. The two best choices for data manipulation (besides base R) are:

data.table
dplyr

Remember, the following analysis is completed using dplyr as a wrapper for SQL statements, operating on-disk. Use explain() to expose the SQL commands in your statements. You can simply use raw SQL strings if you prefer. I will also show how to run the same commands in data.table in the comments.

I am not going to benchmark every query, as there are many examples of the speed advantages of data.table. I prefer data.table for both the speed boost, and compact syntax. For a first look at data.table, take a look at this cheatsheet.

Preview the data

I make use of kable() throughout this document as a way to print pretty tables.

# Wrapped in a function for display purposes
head_ <- function(x, n = 5) kable(head(x, n))

head_(data)

Agency	CreatedDate	ComplaintType	Descriptor	City
NYPD	04/11/2015 02:13:04 AM	Noise - Street/Sidewalk	Loud Music/Party	BROOKLYN
DFTA	04/11/2015 02:12:05 AM	Senior Center Complaint	N/A	ELMHURST
NYPD	04/11/2015 02:11:46 AM	Noise - Commercial	Loud Music/Party	JAMAICA
NYPD	04/11/2015 02:11:02 AM	Noise - Street/Sidewalk	Loud Talking	BROOKLYN
NYPD	04/11/2015 02:10:45 AM	Noise - Street/Sidewalk	Loud Music/Party	NEW YORK

Select just a couple of columns

# dt[, .(ComplaintType, Descriptor, Agency)]

q <- data %>% select(ComplaintType, Descriptor, Agency)
head_(q)

ComplaintType	Descriptor	Agency
Noise - Street/Sidewalk	Loud Music/Party	NYPD
Senior Center Complaint	N/A	DFTA
Noise - Commercial	Loud Music/Party	NYPD
Noise - Street/Sidewalk	Loud Talking	NYPD
Noise - Street/Sidewalk	Loud Music/Party	NYPD

Limit the number of items retrieved

# dt[, .(ComplaintType, Descriptor, Agency)][1:10]

q <- data %>% select(ComplaintType, Descriptor, Agency)
head_(q, n = 10)

ComplaintType	Descriptor	Agency
Noise - Street/Sidewalk	Loud Music/Party	NYPD
Senior Center Complaint	N/A	DFTA
Noise - Commercial	Loud Music/Party	NYPD
Noise - Street/Sidewalk	Loud Talking	NYPD
Noise - Street/Sidewalk	Loud Music/Party	NYPD
Noise - Street/Sidewalk	Loud Talking	NYPD
Noise - Commercial	Loud Music/Party	NYPD
HPD Literature Request	The ABCs of Housing - Spanish	HPD
Noise - Street/Sidewalk	Loud Talking	NYPD
Street Condition	Plate Condition - Noisy	DOT

Filter rows with WHERE

# dt[Agency == 'NYPD', .(ComplaintType, Descriptor, Agency)]

q <- data %>% 
          select(ComplaintType, Descriptor, Agency) %>% 
          filter(Agency == "NYPD")
head_(q)

ComplaintType	Descriptor	Agency
Noise - Street/Sidewalk	Loud Music/Party	NYPD
Noise - Commercial	Loud Music/Party	NYPD
Noise - Street/Sidewalk	Loud Talking	NYPD
Noise - Street/Sidewalk	Loud Music/Party	NYPD
Noise - Street/Sidewalk	Loud Talking	NYPD

Filter multiple values in a column with WHERE and IN

# dt[Agency == 'NYPD' | Agency == 'DOB', .(ComplaintType, Descriptor, Agency)]

q <- data %>% select(ComplaintType, Descriptor, Agency) %>% 
               filter(Agency %in% c('DOB', 'NYPD'))
head_(q)

ComplaintType	Descriptor	Agency
Noise - Street/Sidewalk	Loud Music/Party	NYPD
Noise - Commercial	Loud Music/Party	NYPD
Noise - Street/Sidewalk	Loud Talking	NYPD
Noise - Street/Sidewalk	Loud Music/Party	NYPD
Noise - Street/Sidewalk	Loud Talking	NYPD

Find the unique values in a column with DISTINCT

# dt[, unique(City)]

q <- data %>% select(City) %>% distinct()
head(q)

##       City
## 1 BROOKLYN
## 2 ELMHURST
## 3  JAMAICA
## 4 NEW YORK
## 5         
## 6  BAYSIDE

Query value counts with COUNT(*) and GROUP BY

# dt[, .(No.Complaints = .N), Agency]
#setkey(dt, No.Complaints) # setkey index's the data

q <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n())
head_(q)

Agency	No.Complaints
3-1-1	22499
ACS	3
AJC	7
ART	3
CAU	8

Order the results with ORDER and -

# dt[, .(No.Complaints = .N), Agency]
#setkey(dt, No.Complaints) # setkey index's the data

q <- data %>% select(Agency) %>% 
              group_by(Agency) %>% 
              summarise(No.Complaints = n()) %>%
              arrange(-No.Complaints)

# Pull the data out of memory to plot it
q <- collect(q)

# Convert to ordered factor to maintain order in plot
q$Agency <- factor(q$Agency, levels = q$Agency, ordered = T)

library(ggplot2)
# Plot top 50
plt <- ggplot(q[1:50,], aes(Agency, No.Complaints)) + 
        geom_bar(stat= 'identity') +
        theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')

Heat / Hot Water is the most common complaint

# dt[, .(No.Complaints = .N), ComplaintType]
#setkey(dt, No.Complaints) # setkey index's the data

q <- data %>% select(ComplaintType, Agency) %>% 
              group_by(ComplaintType) %>% 
              summarise(No.Complaints = n()) %>%
              arrange(-No.Complaints)

# Pull the data out of memory to plot it
q <- collect(q)

# Convert to ordered factor to maintain order in plot
q$ComplaintType <- factor(q$ComplaintType, levels = q$ComplaintType, ordered = T)

# Plot the data (top 50)
plt <- ggplot(q[1:50,], aes(x = ComplaintType, y = No.Complaints)) + 
            geom_bar(stat= 'identity') +
            theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')

What is the most common complaint in each city?

How many cities are in the database?

# dt[, unique(City)]

q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())
head(q)

##   Number.of.Cities
## 1             1818

Yikes - let’s just plot the 10 most complained about cities

# dt[, (No.Complaints = .N), City]
#setkey(dt, No.Complaints)

q <- data %>% select(City) %>% group_by(City) %>% 
        summarise(No.Complaints = n()) %>%
        arrange(-No.Complaints)

head_(q, 10)

City	No.Complaints
BROOKLYN	2671085
NEW YORK	1692514
BRONX	1624292
	766378
STATEN ISLAND	437395
JAMAICA	147133
FLUSHING	117669
ASTORIA	90570
Jamaica	67083
RIDGEWOOD	66411

Flushing and FLUSHING, Jamaica and JAMAICA… the complaints are case sensitive.

Perform case insensitive queries with GROUP BY with COLLATE NOCASE

I decided to use UPPER to convert the CITY format.

# dt[, CITY := toupper(City)][, (No.Complaints = .N), CITY]
#setkey(dt, No.Complaints)

# No clear way to do this using dplyr functions - default to regular SQL
q <- tbl(db, sql('SELECT UPPER(City) as "CITY", COUNT(*) as "No.Complaints"
                    FROM NYC_data 
                    GROUP BY "CITY" 
                    ORDER BY -"No.Complaints"
                    LIMIT 11'))

head_(q, 10)

CITY	No.Complaints
BROOKLYN	2671085
NEW YORK	1692514
BRONX	1624292
	766378
STATEN ISLAND	437395
JAMAICA	147133
FLUSHING	117669
ASTORIA	90570
JAMAICA	67083
RIDGEWOOD	66411

Complaint type count by city

The plotly article loops through each city and runs an sql command. Simply grouping by two columns City and Complaint type accomplishes the same task. I reduce the number of items plotted here for display purposes.

# dt[, CITY := toupper(City)][, (No.Complaints = .N), .(ComplaintType, CITY)]
#setkey(dt, No.Complaints)

# No clear way to do this using dplyr functions - default to regular SQL
q <- tbl(db, sql('SELECT "Complaint Type" as "ComplaintType", UPPER(City) as "CITY", 
                    COUNT(*) as "No.Complaints"
                    FROM NYC_data 
                    GROUP BY CITY, ComplaintType
                    ORDER BY -"No.Complaints"'))


# Select only list of cities used in ploty article
q_f <- filter(q, CITY %in% c(
                        'FAR ROCKAWAY',
                        'FLUSHING',
                        'JAMAICA',
                        'STATEN ISLAND',
                        'BRONX',
                        'NEW YORK',
                        'BROOKLYN'
))

# Pull the data out of memory to plot it
q_f <- collect(q_f)

# Select a cutoff for the most popular complaints to plot
# dt[, sum(No.Complaints), ComplaintType][, .(ComplaintType)]
top_complaints <- q_f%>% group_by(ComplaintType) %>% 
                        summarise(n = sum(No.Complaints)) %>%
                        arrange(-n) 
# Top 15 complaints
q_f <- filter(q_f, ComplaintType %in% top_complaints$ComplaintType[1:15])

# Plot result
plt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) + 
            geom_bar(stat = 'identity') + 
            theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

plt

# plotly cannot handle this type of plot currently.

Now let’s normalize these counts. This is super easy now that this data has been reduced into a dataframe.

# dt[, No.Complaints_normalized := No.Complaints / sum(No.Complaints)]
q_f <- q_f %>% group_by(CITY) %>% 
                mutate(Normalized.Complaints = round((No.Complaints / sum(No.Complaints))*100, 2))

# Plot result
plt <- ggplot(q_f, aes(ComplaintType, Normalized.Complaints, fill = CITY)) + 
    geom_bar(stat = 'identity', position = 'dodge') + 
    theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    ggtitle('Relative Number of Complaints by City')

plt

# plotly cannot handle this type of plot currently.

New York is loud.
Staten Island is quite dirty, dark and soggy
Bronx has great garbage collection.
Flushing’s muni meters are broken.

Part 2 Time Series Operations

The data provided does not fit the standard date format for SQLite. We have a few options to remedy this:

Create a new column in the SQL database, and reinsert the data with the formatted date statements
Create a new table and INSERT the formatted date into the original column name
Generate a SELECT wrapper using dplyr that formats the dates.

To illustrate using R, I will use dplyr. This is not necessarily the most optimal approach. Here I select substrings of the date fields, with SUBSTR and concatenate them with ||.

data <- tbl(db, sql('SELECT Agency, "Complaint Type" AS ComplaintType, Descriptor, City,
                    SUBSTR("Created Date", 7, 4) || "-" ||
                    SUBSTR("Created Date", 4, 2) || "-" ||
                    SUBSTR("Created Date", 1, 2) || " " ||
                    SUBSTR("Created Date", 12, 2) || ":" ||
                    SUBSTR("Created Date", 15, 2) || ":" ||
                    SUBSTR("Created Date", 18, 2) as CreatedDate, 
                    
                    SUBSTR("Closed Date", 7, 4) || "-" ||
                    SUBSTR("Closed Date", 4, 2) || "-" ||
                    SUBSTR("Closed Date", 1, 2) || " " ||
                    SUBSTR("Closed Date", 12, 2) || ":" ||
                    SUBSTR("Closed Date", 15, 2) || ":" ||
                    SUBSTR("Closed Date", 18, 2) as ClosedDate
                    
            FROM NYC_data'))

Filter SQLite rows with timestamp strings: YYYY-MM-DD hh:mm:ss

# dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00', 
#      .(ComplaintType, CreatedDate, City)]

q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00",   CreatedDate > "2014-09-16 23:45:00") %>%
    select(ComplaintType, CreatedDate, City)

head_(q)

ComplaintType	CreatedDate	City
Noise - Street/Sidewalk	2014-11-12 11:59:56	BRONX
Taxi Complaint	2014-11-12 11:59:40	BROOKLYN
Noise - Commercial	2014-11-12 11:58:53	BROOKLYN
Noise - Commercial	2014-11-12 11:58:26	NEW YORK
Noise - Street/Sidewalk	2014-11-12 11:58:14	NEW YORK

Pull out the hour unit from timestamps with strftime

See ?strftime for allformat methods

# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]

q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>% 
            select(ComplaintType, CreatedDate, City, hour)

head_(q)

ComplaintType	CreatedDate	City	hour
Noise - Street/Sidewalk	2015-11-04 02:13:04	BROOKLYN	02
Senior Center Complaint	2015-11-04 02:12:05	ELMHURST	02
Noise - Commercial	2015-11-04 02:11:46	JAMAICA	02
Noise - Street/Sidewalk	2015-11-04 02:11:02	BROOKLYN	02
Noise - Street/Sidewalk	2015-11-04 02:10:45	NEW YORK	02

Count the number of complaints (rows) per hour with strftime , GROUP BY , and count(*) (n())

# dt[, hour := strftime('%H', CreatedDate), .N , hour]

q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>% 
    group_by(hour) %>% summarise(Complaints.per.Hour = n())

# Collect the data into memory
q <- collect(q)

plt <- ggplot(na.omit(q), aes(hour, Complaints.per.Hour)) + 
        geom_bar(stat='identity') + theme_minimal()

# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')

Filter noise complaints by hour (might be easier to use `LIKE` operator in raw SQL)

# dt[grepl(ComplaintType, 'Noise'), hour := strftime('%H', CreatedDate), .N , hour]

q <- data %>% filter(ComplaintType %in% c( "Noise",
                                           "Noise - Street/Sidewalk", 
                                           "Noise - Commercial", 
                                           "Noise - Vehicle", 
                                           "Noise - Park", 
                                           "Noise - House of Worship", 
                                           "Noise - Helicopter", 
                                           "Collection Truck Noise"))  %>%
    mutate(hour = strftime('%H', CreatedDate)) %>% 
    group_by(hour) %>% summarise(Complaints.per.Hour = n())

# Collect the data into memory
q <- collect(q)
# omit NA
plt <- ggplot(na.omit(q), aes(hour, Complaints.per.Hour)) + 
            geom_bar(stat='identity') + theme_minimal()

# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')

Segregate complaints by hour

This can be written more succinctly than the plotly article using two fields in the GROUP_BY() statement.

# dt[grepl(ComplaintType, 'Noise'), hour := strftime('%H', CreatedDate), .N , hour]
# setkey(dt, Complaints.per.Hour)
# dt[, tail(.SD, 2), hour]

q <- data %>% 
    mutate(hour = strftime('%H', CreatedDate)) %>% 
    group_by(hour, ComplaintType) %>% summarise(Complaints.per.Hour = n())

# Collect the data into memory
q <- collect(q)
# omit NA
q <- na.omit(q)

# Grab the 2 most common complaints for that hour
# Top 6 is way too many colors
q_hr <-q %>%
      group_by(hour) %>%
      top_n(n = 2, wt = Complaints.per.Hour)

# Plot (something appears to be off with midnight in my dataset)
plt <- ggplot(q_hr[q_hr$hour != '12',], aes(hour, Complaints.per.Hour, fill = ComplaintType)) + 
            geom_bar(stat='identity') + theme_minimal()

# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')

Aggregate Time Series

First, create a new column with timestamps rounded to the previous 15 minute interval

# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]

q <- data %>% 
     mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%                     
     select(CreatedDate, interval)

head_(q, 10)

CreatedDate	interval
2015-11-04 02:13:04	2015-11-04 02:00:00
2015-11-04 02:12:05	2015-11-04 02:00:00
2015-11-04 02:11:46	2015-11-04 02:00:00
2015-11-04 02:11:02	2015-11-04 02:00:00
2015-11-04 02:10:45	2015-11-04 02:00:00
2015-11-04 02:09:07	2015-11-04 02:00:00
2015-11-04 02:05:47	2015-11-04 02:00:00
2015-11-04 02:03:43	2015-11-04 02:00:00
2015-11-04 02:03:29	2015-11-04 02:00:00
2015-11-04 02:02:17	2015-11-04 02:00:00

Then, GROUP BY that interval and COUNT(*)

# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .N, interval]

q <- data %>% 
     mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%  
     group_by(interval) %>%
     summarise(Complaints.per.Interval = n()) %>% filter(!is.na(Complaints.per.Interval))

head_(q, 10)

interval	Complaints.per.Interval
NA	5447756
2003-01-03 01:00:00	1
2003-01-03 03:15:00	2
2003-01-03 06:30:00	2
2003-01-03 06:45:00	1
2003-01-03 07:30:00	1
2003-01-03 09:15:00	1
2003-01-03 09:30:00	1
2003-01-03 09:45:00	1
2003-01-03 11:00:00	1

# Pull data into memory
q <- collect(q %>% filter(strftime('%Y', interval) == '2003'))

Plot the results for 2003

# Convert to proper datetime object
q$interval <- strptime(q$interval, '%Y-%m-%d %H:%M')

plt <- ggplot(q, aes(x=interval, y=Complaints.per.Interval)) + 
            geom_bar(stat='identity') +
            theme_minimal() 

# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')

Same plot by hours

# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(86400, 'seconds')][, .N, interval]

q <- data %>% 
     mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 86400) * 86400, 'unixepoch')")) %>%  
     group_by(interval) %>%
     summarise(Complaints.per.Interval = n()) %>% filter(!is.na(Complaints.per.Interval))

# Collect into memory
q <- collect(q)
# Convert to proper datetime object
q$interval <- strptime(q$interval, '%Y-%m-%d %H:%M')

plt <- ggplot(q, aes(x=interval, y=Complaints.per.Interval)) + 
        geom_bar(stat='identity') +
        theme_minimal()

# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')

Big-ish Data Workflow in R

Ryan Kelly

April 11, 2015

Introduction

The Data Workflow

Setting up plotly in R

Importing the CSV in SQLite

On-Disk Analysis in R with dplyr

About dplyr

Preview the data

Select just a couple of columns

Limit the number of items retrieved

Filter rows with WHERE

Filter multiple values in a column with WHERE and IN

Find the unique values in a column with DISTINCT

Query value counts with COUNT(*) and GROUP BY

Order the results with ORDER and -

Heat / Hot Water is the most common complaint

What is the most common complaint in each city?

Yikes - let’s just plot the 10 most complained about cities

Perform case insensitive queries with GROUP BY with COLLATE NOCASE

Complaint type count by city

Part 2 Time Series Operations

Filter SQLite rows with timestamp strings: YYYY-MM-DD hh:mm:ss

Pull out the hour unit from timestamps with strftime

Count the number of complaints (rows) per hour with strftime , GROUP BY , and count(*) (n())

Filter noise complaints by hour (might be easier to use `LIKE` operator in raw SQL)

Segregate complaints by hour

Aggregate Time Series

First, create a new column with timestamps rounded to the previous 15 minute interval

Then, GROUP BY that interval and COUNT(*)

Plot the results for 2003

Same plot by hours

Contact me

twitter | email.ryan.kelly@gmail.com | github

Big-ish Data Workflow in R

Ryan Kelly

April 11, 2015

Introduction

The Data Workflow

Setting up plotly in R

Importing the CSV in SQLite

On-Disk Analysis in R with dplyr

About dplyr

Preview the data

Select just a couple of columns

Limit the number of items retrieved

Filter rows with WHERE

Filter multiple values in a column with WHERE and IN

Find the unique values in a column with DISTINCT

Query value counts with COUNT(*) and GROUP BY

Order the results with ORDER and -

Heat / Hot Water is the most common complaint

What is the most common complaint in each city?

Yikes - let’s just plot the 10 most complained about cities

Perform case insensitive queries with GROUP BY with COLLATE NOCASE

Complaint type count by city

Part 2 Time Series Operations

Filter SQLite rows with timestamp strings: YYYY-MM-DD hh:mm:ss

Pull out the hour unit from timestamps with strftime

Count the number of complaints (rows) per hour with strftime , GROUP BY , and count(*) (n())

Filter noise complaints by hour (might be easier to use LIKE operator in raw SQL)

Segregate complaints by hour

Aggregate Time Series

First, create a new column with timestamps rounded to the previous 15 minute interval

Then, GROUP BY that interval and COUNT(*)

Plot the results for 2003

Same plot by hours

Contact me

twitter | email.ryan.kelly@gmail.com | github

Filter noise complaints by hour (might be easier to use `LIKE` operator in raw SQL)