This is an R version of a recent article Big data analytics with Pandas and SQLite by plotly. I’d argue this is ‘medium’ data, however, I understand the need to generate a nice headline. For the R folks, I also expand on alternative packages and methods that may improve the workflow.
This document is not meant to make any claims that R is better or faster than Python for data analysis, I myself use both languages daily. The plotly article simply provides an opportunity to compare the two languages.
The data from this article is updated daily, therefore my version of the file is quite a bit larger at 4.63 GB. Therefore, expect the charts to return different results compared to the article.
Here are the details quoted from the article…This notebook explores a 3.9GB CSV file containing NYC’s 311 complaints since 2003. It’s the most popular data set in NYC’s open data portal.
This notebook is a primer on out-of-memory data analysis with
The dataset is too large to load into a Pandas dataframe. So, instead we’ll perform out-of-memory aggregations with SQLite and load the result directly into a dataframe with Panda’s iotools . It’s pretty easy to stream a CSV into SQLite and SQLite requires no setup. The SQL query language is pretty intuitive coming from a Pandas mindset.
Since we are using R, we will be substituting the python packages with the closest R counterpart, this is of course subjective.
Instead of IPython notebook I have published this document using knitr / Rmarkdown via RStudio
I will maintain the use of SQLite, and plotly for R & ggplot2. However, for time series charts, I highly recommend the R package dygraphs, which provides similar interactive plotting, without requiring an internet connection.
Where possible I will outline to opporunities to substitute different R packages that in my opinion may improve the workflow.
plotly for R is not yet available on CRAN. We can instead download it from the rOpenSci github repo.
install.packages("devtools")
library("devtools")
install_github("ropensci/plotly")
library(plotly)
You’ll need to create an account to connect to the plotly API. Or you could just stick with the default ggplot2 graphics.
set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key
This defeats the purpose of this article, but R can load this data into memory easily. If your computer resources permit it, the data will be much faster to operate on in-memory compared to an SQL database on disk. 8GB of RAM would be plenty in this case. In the spirit of a true comparison, I will replicate the on-disk analysis approach using an SQLite database, however, I will show a few benchmark’s using the data in memory using data.table along the way.
While dplyr is capable of writing to databases, the data still must flow through R, which would probably be considered cheating in this case. The alternative is to simply to create a database and import the csv using the command line. Please create a pull request for this github repository if you have any other suggestions.
Here is the code I typed into the command line. Pretty easy right? This assumes you have sqlite3 installed and available on your PATH variable (hence accessible via terminal).
$ sqlite3 data.db # Create your database
$.databases # Show databases to make sure it works
$.mode csv
$.import <filename> <tablename>
# Where filename is the name of the csv & tablename is the name of the new database table
$.quit
Let’s also load the data into memory so we can compare in memory operations along the way. Here is a crude benchmark of file I/O in R. readr a recently released alternative to read.csv should also eat through this data quickly.
library(readr)
# data.table, selecting a subset of columns
time_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv',
select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'),
showProgress = T))
# Default data.table
time_data.table_full <- system.time(fread('/users/ryankelly/NYC_data.csv',
showProgress = T))
# Default readr
time_readr <- system.time(read_csv('/users/ryankelly/NYC_data.csv'))
# Default base R (really slow - don't recommend running)
# time_base_r <- system.time(read.csv('/users/ryankelly/NYC_data.csv'))
kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))
| user.self | sys.self | elapsed | user.child | sys.child | |
|---|---|---|---|---|---|
| time_data.table | 63.588 | 1.952 | 65.633 | 0 | 0 |
| time_data.table_full | 205.571 | 3.124 | 208.880 | 0 | 0 |
| time_readr | 277.720 | 5.018 | 283.029 | 0 | 0 |
I will be using data.table to read in the data. The fread function has the really nice ability to select the columns we want to read in, this speeds up the read quite a bit. The plotly article disregards all but the following columns from the dataset.
library(data.table)
dt <- fread('/users/ryankelly/NYC_data.csv',
select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'),
showProgress = F)
# Rename columns to remove spaces
setnames(dt, 'Created Date', 'CreatedDate')
setnames(dt, 'Closed Date', 'ClosedDate')
setnames(dt, 'Complaint Type', 'ComplaintType')
Switching between languages is cognitively challenging (especially because R and SQL are so perilously similar), so dplyr allows you to write R code that is automatically translated to SQL. The goal of dplyr is not to replace every SQL function with an R function: that would be difficult and error prone. Instead, dplyr only generates SELECT statements, the SQL you write most often as an analyst. - dplyr databases
dplyr is pretty fantastic because you can use the same syntax to:
In the event you need to run a more sophisticated query or if you prefer plain SQL, you can also simply pass in an SQL statement as a string, or generate your own sql wrappers. dplyr supports: sqlite, mysql, postgresql, and google’s bigquery. You will see that dplyr’s syntax is very similar to SQL.
For a more information about dplyr, see these two tutorials, which I borrow from heavily:
By default, dplyr queries will only pull the first 10 rows from the database, and never pulls data into R unless you ask for it directly.
library(dplyr) ## Will be used for pandas replacement
# Connect to the database
db <- src_sqlite('/users/ryankelly/data.db')
db
## src: sqlite 3.8.6 [/users/ryankelly/data.db]
## tbls: NYC_data
# Connect to the table of interest, which I called NYC_data
data <- tbl(db, 'NYC_data')
# We can pass SQL directly here to select only the columns of interest
# I 'rename' the columns with spaces to be able to properly access them in dplyr
data <- tbl(db, sql('SELECT Agency, "Created Date" AS CreatedDate,
"Closed Date" AS ClosedDate, "Complaint Type" AS ComplaintType, Descriptor, City
FROM NYC_data'))
Once we have to do a more expensive query, the data become pretty slow to work with. This would be apparent in both Python or R. This is why you would rather operate on the data in-memory. The two best choices for data manipulation (besides base R) are:
Remember, the following analysis is completed using dplyr as a wrapper for SQL statements, operating on-disk. Use explain() to expose the SQL commands in your statements. You can simply use raw SQL strings if you prefer. I will also show how to run the same commands in data.table in the comments.
I am not going to benchmark every query, as there are many examples of the speed advantages of data.table. I prefer data.table for both the speed boost, and compact syntax. For a first look at data.table, take a look at this cheatsheet.
I make use of kable() throughout this document as a way to print pretty tables.
# Wrapped in a function for display purposes
head_ <- function(x, n = 5) kable(head(x, n))
head_(data)
| Agency | CreatedDate | ClosedDate | ComplaintType | Descriptor | City |
|---|---|---|---|---|---|
| NYPD | 04/11/2015 02:13:04 AM | Noise - Street/Sidewalk | Loud Music/Party | BROOKLYN | |
| DFTA | 04/11/2015 02:12:05 AM | Senior Center Complaint | N/A | ELMHURST | |
| NYPD | 04/11/2015 02:11:46 AM | Noise - Commercial | Loud Music/Party | JAMAICA | |
| NYPD | 04/11/2015 02:11:02 AM | Noise - Street/Sidewalk | Loud Talking | BROOKLYN | |
| NYPD | 04/11/2015 02:10:45 AM | Noise - Street/Sidewalk | Loud Music/Party | NEW YORK |
# dt[, .(ComplaintType, Descriptor, Agency)]
q <- data %>% select(ComplaintType, Descriptor, Agency)
head_(q)
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Senior Center Complaint | N/A | DFTA |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
# dt[, .(ComplaintType, Descriptor, Agency)][1:10]
q <- data %>% select(ComplaintType, Descriptor, Agency)
head_(q, n = 10)
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Senior Center Complaint | N/A | DFTA |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Commercial | Loud Music/Party | NYPD |
| HPD Literature Request | The ABCs of Housing - Spanish | HPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Street Condition | Plate Condition - Noisy | DOT |
# dt[Agency == 'NYPD', .(ComplaintType, Descriptor, Agency)]
q <- data %>%
select(ComplaintType, Descriptor, Agency) %>%
filter(Agency == "NYPD")
head_(q)
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
# dt[Agency == 'NYPD' | Agency == 'DOB', .(ComplaintType, Descriptor, Agency)]
q <- data %>% select(ComplaintType, Descriptor, Agency) %>%
filter(Agency %in% c('DOB', 'NYPD'))
head_(q)
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
# dt[, unique(City)]
q <- data %>% select(City) %>% distinct()
head(q)
## City
## 1 BROOKLYN
## 2 ELMHURST
## 3 JAMAICA
## 4 NEW YORK
## 5
## 6 BAYSIDE
# dt[, .(No.Complaints = .N), Agency]
#setkey(dt, No.Complaints) # setkey index's the data
q <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n())
head_(q)
| Agency | No.Complaints |
|---|---|
| 3-1-1 | 22499 |
| ACS | 3 |
| AJC | 7 |
| ART | 3 |
| CAU | 8 |
# dt[, .(No.Complaints = .N), Agency]
#setkey(dt, No.Complaints) # setkey index's the data
q <- data %>% select(Agency) %>%
group_by(Agency) %>%
summarise(No.Complaints = n()) %>%
arrange(-No.Complaints)
# Pull the data out of memory to plot it
q <- collect(q)
# Convert to ordered factor to maintain order in plot
q$Agency <- factor(q$Agency, levels = q$Agency, ordered = T)
library(ggplot2)
# Plot top 50
plt <- ggplot(q[1:50,], aes(Agency, No.Complaints)) +
geom_bar(stat= 'identity') +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')
# dt[, .(No.Complaints = .N), ComplaintType]
#setkey(dt, No.Complaints) # setkey index's the data
q <- data %>% select(ComplaintType, Agency) %>%
group_by(ComplaintType) %>%
summarise(No.Complaints = n()) %>%
arrange(-No.Complaints)
# Pull the data out of memory to plot it
q <- collect(q)
# Convert to ordered factor to maintain order in plot
q$ComplaintType <- factor(q$ComplaintType, levels = q$ComplaintType, ordered = T)
# Plot the data (top 50)
plt <- ggplot(q[1:50,], aes(x = ComplaintType, y = No.Complaints)) +
geom_bar(stat= 'identity') +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')
How many cities are in the database?
# dt[, unique(City)]
q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())
head(q)
## Number.of.Cities
## 1 1818
# dt[, (No.Complaints = .N), City]
#setkey(dt, No.Complaints)
q <- data %>% select(City) %>% group_by(City) %>%
summarise(No.Complaints = n()) %>%
arrange(-No.Complaints)
head_(q, 10)
| City | No.Complaints |
|---|---|
| BROOKLYN | 2671085 |
| NEW YORK | 1692514 |
| BRONX | 1624292 |
| 766378 | |
| STATEN ISLAND | 437395 |
| JAMAICA | 147133 |
| FLUSHING | 117669 |
| ASTORIA | 90570 |
| Jamaica | 67083 |
| RIDGEWOOD | 66411 |
UPPER to convert the CITY format.# dt[, CITY := toupper(City)][, (No.Complaints = .N), CITY]
#setkey(dt, No.Complaints)
# No clear way to do this using dplyr functions - default to regular SQL
q <- tbl(db, sql('SELECT UPPER(City) as "CITY", COUNT(*) as "No.Complaints"
FROM NYC_data
GROUP BY "CITY"
ORDER BY -"No.Complaints"
LIMIT 11'))
head_(q, 10)
| CITY | No.Complaints |
|---|---|
| BROOKLYN | 2671085 |
| NEW YORK | 1692514 |
| BRONX | 1624292 |
| 766378 | |
| STATEN ISLAND | 437395 |
| JAMAICA | 147133 |
| FLUSHING | 117669 |
| ASTORIA | 90570 |
| JAMAICA | 67083 |
| RIDGEWOOD | 66411 |
City and Complaint type accomplishes the same task. I reduce the number of items plotted here for display purposes.# dt[, CITY := toupper(City)][, (No.Complaints = .N), .(ComplaintType, CITY)]
#setkey(dt, No.Complaints)
# No clear way to do this using dplyr functions - default to regular SQL
q <- tbl(db, sql('SELECT "Complaint Type" as "ComplaintType", UPPER(City) as "CITY",
COUNT(*) as "No.Complaints"
FROM NYC_data
GROUP BY CITY, ComplaintType
ORDER BY -"No.Complaints"'))
# Select only list of cities used in ploty article
q_f <- filter(q, CITY %in% c(
'FAR ROCKAWAY',
'FLUSHING',
'JAMAICA',
'STATEN ISLAND',
'BRONX',
'NEW YORK',
'BROOKLYN'
))
# Pull the data out of memory to plot it
q_f <- collect(q_f)
# Select a cutoff for the most popular complaints to plot
# dt[, sum(No.Complaints), ComplaintType][, .(ComplaintType)]
top_complaints <- q_f%>% group_by(ComplaintType) %>%
summarise(n = sum(No.Complaints)) %>%
arrange(-n)
# Top 15 complaints
q_f <- filter(q_f, ComplaintType %in% top_complaints$ComplaintType[1:15])
# Plot result
plt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) +
geom_bar(stat = 'identity') +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
plt
# plotly cannot handle this type of plot currently.
Now let’s normalize these counts. This is super easy now that this data has been reduced into a dataframe.
# dt[, No.Complaints_normalized := No.Complaints / sum(No.Complaints)]
q_f <- q_f %>% group_by(CITY) %>%
mutate(Normalized.Complaints = round((No.Complaints / sum(No.Complaints))*100, 2))
# Plot result
plt <- ggplot(q_f, aes(ComplaintType, Normalized.Complaints, fill = CITY)) +
geom_bar(stat = 'identity', position = 'dodge') +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle('Relative Number of Complaints by City')
plt
# plotly cannot handle this type of plot currently.
The data provided does not fit the standard date format for SQLite. We have a few options to remedy this:
To illustrate using R, I will use dplyr. This is not necessarily the most optimal approach. Here I select substrings of the date fields, with SUBSTR and concatenate them with ||.
data <- tbl(db, sql('SELECT Agency, "Complaint Type" AS ComplaintType, Descriptor, City,
SUBSTR("Created Date", 7, 4) || "-" ||
SUBSTR("Created Date", 4, 2) || "-" ||
SUBSTR("Created Date", 1, 2) || " " ||
SUBSTR("Created Date", 12, 2) || ":" ||
SUBSTR("Created Date", 15, 2) || ":" ||
SUBSTR("Created Date", 18, 2) as CreatedDate,
SUBSTR("Closed Date", 7, 4) || "-" ||
SUBSTR("Closed Date", 4, 2) || "-" ||
SUBSTR("Closed Date", 1, 2) || " " ||
SUBSTR("Closed Date", 12, 2) || ":" ||
SUBSTR("Closed Date", 15, 2) || ":" ||
SUBSTR("Closed Date", 18, 2) as ClosedDate
FROM NYC_data'))
# dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00',
# .(ComplaintType, CreatedDate, City)]
q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00", CreatedDate > "2014-09-16 23:45:00") %>%
select(ComplaintType, CreatedDate, City)
head_(q)
| ComplaintType | CreatedDate | City |
|---|---|---|
| Noise - Street/Sidewalk | 2014-11-12 11:59:56 | BRONX |
| Taxi Complaint | 2014-11-12 11:59:40 | BROOKLYN |
| Noise - Commercial | 2014-11-12 11:58:53 | BROOKLYN |
| Noise - Commercial | 2014-11-12 11:58:26 | NEW YORK |
| Noise - Street/Sidewalk | 2014-11-12 11:58:14 | NEW YORK |
format methods# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]
q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>%
select(ComplaintType, CreatedDate, City, hour)
head_(q)
| ComplaintType | CreatedDate | City | hour |
|---|---|---|---|
| Noise - Street/Sidewalk | 2015-11-04 02:13:04 | BROOKLYN | 02 |
| Senior Center Complaint | 2015-11-04 02:12:05 | ELMHURST | 02 |
| Noise - Commercial | 2015-11-04 02:11:46 | JAMAICA | 02 |
| Noise - Street/Sidewalk | 2015-11-04 02:11:02 | BROOKLYN | 02 |
| Noise - Street/Sidewalk | 2015-11-04 02:10:45 | NEW YORK | 02 |
# dt[, hour := strftime('%H', CreatedDate), .N , hour]
q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>%
group_by(hour) %>% summarise(Complaints.per.Hour = n())
# Collect the data into memory
q <- collect(q)
plt <- ggplot(na.omit(q), aes(hour, Complaints.per.Hour)) +
geom_bar(stat='identity') + theme_minimal()
# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')
LIKE operator in raw SQL)# dt[grepl(ComplaintType, 'Noise'), hour := strftime('%H', CreatedDate), .N , hour]
q <- data %>% filter(ComplaintType %in% c( "Noise",
"Noise - Street/Sidewalk",
"Noise - Commercial",
"Noise - Vehicle",
"Noise - Park",
"Noise - House of Worship",
"Noise - Helicopter",
"Collection Truck Noise")) %>%
mutate(hour = strftime('%H', CreatedDate)) %>%
group_by(hour) %>% summarise(Complaints.per.Hour = n())
# Collect the data into memory
q <- collect(q)
# omit NA
plt <- ggplot(na.omit(q), aes(hour, Complaints.per.Hour)) +
geom_bar(stat='identity') + theme_minimal()
# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')
GROUP_BY() statement.# dt[grepl(ComplaintType, 'Noise'), hour := strftime('%H', CreatedDate), .N , hour]
# setkey(dt, Complaints.per.Hour)
# dt[, tail(.SD, 2), hour]
q <- data %>%
mutate(hour = strftime('%H', CreatedDate)) %>%
group_by(hour, ComplaintType) %>% summarise(Complaints.per.Hour = n())
# Collect the data into memory
q <- collect(q)
# omit NA
q <- na.omit(q)
# Grab the 2 most common complaints for that hour
# Top 6 is way too many colors
q_hr <-q %>%
group_by(hour) %>%
top_n(n = 2, wt = Complaints.per.Hour)
# Plot (something appears to be off with midnight in my dataset)
plt <- ggplot(q_hr[q_hr$hour != '12',], aes(hour, Complaints.per.Hour, fill = ComplaintType)) +
geom_bar(stat='identity') + theme_minimal()
# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')
# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]
q <- data %>%
mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%
select(CreatedDate, interval)
head_(q, 10)
| CreatedDate | interval |
|---|---|
| 2015-11-04 02:13:04 | 2015-11-04 02:00:00 |
| 2015-11-04 02:12:05 | 2015-11-04 02:00:00 |
| 2015-11-04 02:11:46 | 2015-11-04 02:00:00 |
| 2015-11-04 02:11:02 | 2015-11-04 02:00:00 |
| 2015-11-04 02:10:45 | 2015-11-04 02:00:00 |
| 2015-11-04 02:09:07 | 2015-11-04 02:00:00 |
| 2015-11-04 02:05:47 | 2015-11-04 02:00:00 |
| 2015-11-04 02:03:43 | 2015-11-04 02:00:00 |
| 2015-11-04 02:03:29 | 2015-11-04 02:00:00 |
| 2015-11-04 02:02:17 | 2015-11-04 02:00:00 |
# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .N, interval]
q <- data %>%
mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%
group_by(interval) %>%
summarise(Complaints.per.Interval = n()) %>% filter(!is.na(Complaints.per.Interval))
head_(q, 10)
| interval | Complaints.per.Interval |
|---|---|
| NA | 5447756 |
| 2003-01-03 01:00:00 | 1 |
| 2003-01-03 03:15:00 | 2 |
| 2003-01-03 06:30:00 | 2 |
| 2003-01-03 06:45:00 | 1 |
| 2003-01-03 07:30:00 | 1 |
| 2003-01-03 09:15:00 | 1 |
| 2003-01-03 09:30:00 | 1 |
| 2003-01-03 09:45:00 | 1 |
| 2003-01-03 11:00:00 | 1 |
# Pull data into memory
q <- collect(q %>% filter(strftime('%Y', interval) == '2003'))
# Convert to proper datetime object
q$interval <- strptime(q$interval, '%Y-%m-%d %H:%M')
plt <- ggplot(q, aes(x=interval, y=Complaints.per.Interval)) +
geom_bar(stat='identity') +
theme_minimal()
# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')
# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(86400, 'seconds')][, .N, interval]
q <- data %>%
mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 86400) * 86400, 'unixepoch')")) %>%
group_by(interval) %>%
summarise(Complaints.per.Interval = n()) %>% filter(!is.na(Complaints.per.Interval))
# Collect into memory
q <- collect(q)
# Convert to proper datetime object
q$interval <- strptime(q$interval, '%Y-%m-%d %H:%M')
plt <- ggplot(q, aes(x=interval, y=Complaints.per.Interval)) +
geom_bar(stat='identity') +
theme_minimal()
# Convert to plotly object
py <- plotly()
py$ggplotly(plt, session='knitr')