Handling large data sets with Google BigQuery and R

Introduction

R loads the data in-memory which can impose a limit to how much data can be analysed on a single machine. While dealing with significantly large data sets, this can be serious limitation. There are a few possible approaches to dealing with large volumes of data in R, such as using a Spark cluster or a compatible database service which R can connect to and perform data manipulation and analysis . This vignette demostrates the use of Google BigQuery to store and analyse data using remote data frames without the need to load it in-memory.

Google BigQuery is a serverless no-ops Cloud Datawarehouse designed to store, process and analyse petabyte size datasets. It can scale its compute seamlessly without any ops and it allows to capture and derive meaningful insights from batch and streaming data at real time

For this use case, “bigrquery”, “dplyr” and “DBI” packages are used which provides an abstraction over the underlying BigQuery REST API to interact with the data. bigrquery and DBI provide a low level wrapper over the BigQuery API’s whereas dplyr provides a higher level of abstraction and lets BigQuery tables be treated as data frames.

library(bigrquery) # R Interface to Google BigQuery API  
library(dplyr) # Grammar for data manipulation  
library(DBI) # Interface definition to connect to databases 
library(ggplot2) # Data Viz package

In order to use BigQuery, sign up for a Google cloud account and create a Google Cloud Platform project. Refer to the steps in the quickstart guide - https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui for more information on how to get stared on BigQuery

Setup the connection

The first step is to create a connection to the BigQuery project and the data set. This is done by using the DBI and bigrquery packages and passing the project id and data set id as parameters.

projectid<-'uts-mdsi'
datasetid<-'stds_assignment'
bq_conn <-  dbConnect(bigquery(), 
                            project = projectid,
                            dataset = datasetid, 
                            use_legacy_sql = FALSE
                      )

Data Set

For this use case, we loaded the NSW crime data as tables in BigQuery to analyse the monthly incidents recorded by police from 1995 to 2018. For information on creating tables and loading data, please refer to the Google BigQuery how tos - https://cloud.google.com/bigquery/docs/loading-data

To view the list of tables in BigQuery dataset, use the dbListTables function with the connection details object as a parameter

bigrquery::dbListTables(bq_conn) # List all the tables in BigQuery data set

## [1] "crime_by_postcode" "crime_by_suburb"

The tables can be accessed by using the dplyr::tbl() function and passing the table name and BigQuery connection details as parameters

crime_by_pcode <- dplyr::tbl(bq_conn, 
                              "crime_by_postcode") # connects to a table

Notice that the “crime_by_pcode” is of class tbl_sql and not tibble. This is because tbl() creates a direct reference to the table in BigQuery but does not bring the data in-memory.

class(crime_by_pcode)

## [1] "tbl_BigQueryConnection" "tbl_dbi"               
## [3] "tbl_sql"                "tbl_lazy"              
## [5] "tbl"

Query data using dplyr

Dplyr verbs can be used to query or manipulate the data directly on BigQuery tables. It does a lazy evaluation and doesn’t pull the data in-memory unless requested using the collect() function

top_10_offence_recorded_2018 <-
  crime_by_pcode %>% group_by(Offence_category) %>% summarise(
  offence = sum(
  Jan_2018 + Feb_2018 + Mar_2018 + Apr_2018 + May_2018 + Jun_2018 + Jul_2018 +
  Aug_2018 + Sep_2018 + Oct_2018 + Nov_2018 + Dec_2018,
  na.rm = TRUE
  )
  ) %>% arrange(desc(offence)) %>% top_n(10)
  
  top_10_offence_recorded_2018

## # Source:     lazy query [?? x 2]
## # Database:   BigQueryConnection
## # Ordered by: desc(offence)
##    Offence_category                          offence
##    <chr>                                       <int>
##  1 Theft                                      222331
##  2 Transport regulatory offences              114388
##  3 Against justice procedures                  65484
##  4 Assault                                     62336
##  5 Malicious damage to property                58358
##  6 Drug offences                               46403
##  7 Intimidation, stalking and harassment       31896
##  8 Disorderly conduct                          19663
##  9 Other offences                              14548
## 10 Prohibited and regulated weapons offences   13986

BigQuery data can be easily plotted using ggplot2 package by referring to the remote BigQuery object class

ggplot(top_10_offence_recorded_2018,
       aes(x = Offence_category, y = offence)) + geom_bar(stat = "identity") + coord_flip() + labs(y =
       "No of recorded offences", x = "Offence Category") + geom_text(aes(label = offence),size = 3)

Notice the use of collect() creates an in-memory data frame. The in-memory data frame can be uploaded as a new BigQuery table by calling the bq_table_upload()

top_10_postcode_crime_2018 <-
  crime_by_pcode %>% group_by(Postcode) %>% summarise(
  offence = sum(
  Jan_2018 + Feb_2018 + Mar_2018 + Apr_2018 + May_2018 + Jun_2018 + Jul_2018 +
  Aug_2018 + Sep_2018 + Oct_2018 + Nov_2018 + Dec_2018,
  na.rm = TRUE
  )
  ) %>% arrange(desc(offence)) %>% top_n(10) %>% collect()
  
  top_10_postcode_crime_2018

## # A tibble: 10 x 2
##    Postcode offence
##       <int>   <int>
##  1     2000   20407
##  2     2170   18927
##  3     2770   15448
##  4     2148   14432
##  5     2560   12012
##  6     2150   10592
##  7     2750   10227
##  8     2250    9836
##  9     2500    8768
## 10     2760    8062

  class(top_10_postcode_crime_2018)

## [1] "tbl_df"     "tbl"        "data.frame"

Query data using BigQuery Standard SQL

SQL queries can be submitted to BigQuery from R using the bq_perform_query function. bq_project_query creates an object of class bq_query and the query results can be viewed by passing the object to bq_table_download function.

sql_query <-
  "SELECT Offence_category,sum(Dec_2018) Dec_2018,sum(Nov_2018) Nov_2018,sum(Oct_2018) Oct_2018 FROM `uts-mdsi.stds_assignment.crime_by_postcode`
  group by Offence_category order by Dec_2018 desc limit 10;"
  
  offence_qtr <- bq_project_query(projectid, sql_query)
  
  bq_table_download(offence_qtr)

## # A tibble: 10 x 4
##    Offence_category                      Dec_2018 Nov_2018 Oct_2018
##    <chr>                                    <int>    <int>    <int>
##  1 Theft                                    18383    19056    19144
##  2 Transport regulatory offences             8489     8791     8493
##  3 Against justice procedures                6282     5478     5487
##  4 Assault                                   6229     5411     5140
##  5 Malicious damage to property              5038     4933     5174
##  6 Drug offences                             4260     3928     3165
##  7 Intimidation, stalking and harassment     2946     2890     2787
##  8 Disorderly conduct                        1777     1671     1597
##  9 Other offences                            1380     1185     1084
## 10 Liquor offences                           1226      933      832

Additional Information

The dplyr package provides a higher level abstraction over the Google REST API’s and makes it convenient to use verbs to manipulate data. If you have a good understanding of REST API’s and want to do something that’s not supported by dplyr, then use of bigrquery and/or DBI packages would be most appropriate. To learn more about the functions in bigrquery package, please refer to https://www.rdocumentation.org/packages/bigrquery

References

Community Relations Division, freecall 1800 685 449 & Justice, N.D. of n.d., Datasets, viewed 29 March 2019, https://www.bocsar.nsw.gov.au:443/Pages/bocsar_datasets/Datasets-.aspx.

bigrquery package | R Documentation n.d., viewed 29 March 2019, https://www.rdocumentation.org/packages/bigrquery/versions/1.1.0.