R loads the data in-memory which can impose a limit to how much data can be analysed on a single machine. While dealing with significantly large data sets, this can be serious limitation. There are a few possible approaches to dealing with large volumes of data in R, such as using a Spark cluster or a compatible database service which R can connect to and perform data manipulation and analysis . This vignette demostrates the use of Google BigQuery to store and analyse data using remote data frames without the need to load it in-memory.
Google BigQuery is a serverless no-ops Cloud Datawarehouse designed to store, process and analyse petabyte size datasets. It can scale its compute seamlessly without any ops and it allows to capture and derive meaningful insights from batch and streaming data at real time
For this use case, “bigrquery”, “dplyr” and “DBI” packages are used which provides an abstraction over the underlying BigQuery REST API to interact with the data. bigrquery and DBI provide a low level wrapper over the BigQuery API’s whereas dplyr provides a higher level of abstraction and lets BigQuery tables be treated as data frames.
library(bigrquery) # R Interface to Google BigQuery API
library(dplyr) # Grammar for data manipulation
library(DBI) # Interface definition to connect to databases
library(ggplot2) # Data Viz package
In order to use BigQuery, sign up for a Google cloud account and create a Google Cloud Platform project. Refer to the steps in the quickstart guide - https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui for more information on how to get stared on BigQuery
The first step is to create a connection to the BigQuery project and the data set. This is done by using the DBI and bigrquery packages and passing the project id and data set id as parameters.
projectid<-'uts-mdsi'
datasetid<-'stds_assignment'
bq_conn <- dbConnect(bigquery(),
project = projectid,
dataset = datasetid,
use_legacy_sql = FALSE
)
For this use case, we loaded the NSW crime data as tables in BigQuery to analyse the monthly incidents recorded by police from 1995 to 2018. For information on creating tables and loading data, please refer to the Google BigQuery how tos - https://cloud.google.com/bigquery/docs/loading-data
To view the list of tables in BigQuery dataset, use the dbListTables function with the connection details object as a parameter
bigrquery::dbListTables(bq_conn) # List all the tables in BigQuery data set
## [1] "crime_by_postcode" "crime_by_suburb"
The tables can be accessed by using the dplyr::tbl() function and passing the table name and BigQuery connection details as parameters
crime_by_pcode <- dplyr::tbl(bq_conn,
"crime_by_postcode") # connects to a table
Notice that the “crime_by_pcode” is of class tbl_sql and not tibble. This is because tbl() creates a direct reference to the table in BigQuery but does not bring the data in-memory.
class(crime_by_pcode)
## [1] "tbl_BigQueryConnection" "tbl_dbi"
## [3] "tbl_sql" "tbl_lazy"
## [5] "tbl"
Dplyr verbs can be used to query or manipulate the data directly on BigQuery tables. It does a lazy evaluation and doesn’t pull the data in-memory unless requested using the collect() function
top_10_offence_recorded_2018 <-
crime_by_pcode %>% group_by(Offence_category) %>% summarise(
offence = sum(
Jan_2018 + Feb_2018 + Mar_2018 + Apr_2018 + May_2018 + Jun_2018 + Jul_2018 +
Aug_2018 + Sep_2018 + Oct_2018 + Nov_2018 + Dec_2018,
na.rm = TRUE
)
) %>% arrange(desc(offence)) %>% top_n(10)
top_10_offence_recorded_2018
## # Source: lazy query [?? x 2]
## # Database: BigQueryConnection
## # Ordered by: desc(offence)
## Offence_category offence
## <chr> <int>
## 1 Theft 222331
## 2 Transport regulatory offences 114388
## 3 Against justice procedures 65484
## 4 Assault 62336
## 5 Malicious damage to property 58358
## 6 Drug offences 46403
## 7 Intimidation, stalking and harassment 31896
## 8 Disorderly conduct 19663
## 9 Other offences 14548
## 10 Prohibited and regulated weapons offences 13986
BigQuery data can be easily plotted using ggplot2 package by referring to the remote BigQuery object class
ggplot(top_10_offence_recorded_2018,
aes(x = Offence_category, y = offence)) + geom_bar(stat = "identity") + coord_flip() + labs(y =
"No of recorded offences", x = "Offence Category") + geom_text(aes(label = offence),size = 3)
Notice the use of collect() creates an in-memory data frame. The in-memory data frame can be uploaded as a new BigQuery table by calling the bq_table_upload()
top_10_postcode_crime_2018 <-
crime_by_pcode %>% group_by(Postcode) %>% summarise(
offence = sum(
Jan_2018 + Feb_2018 + Mar_2018 + Apr_2018 + May_2018 + Jun_2018 + Jul_2018 +
Aug_2018 + Sep_2018 + Oct_2018 + Nov_2018 + Dec_2018,
na.rm = TRUE
)
) %>% arrange(desc(offence)) %>% top_n(10) %>% collect()
top_10_postcode_crime_2018
## # A tibble: 10 x 2
## Postcode offence
## <int> <int>
## 1 2000 20407
## 2 2170 18927
## 3 2770 15448
## 4 2148 14432
## 5 2560 12012
## 6 2150 10592
## 7 2750 10227
## 8 2250 9836
## 9 2500 8768
## 10 2760 8062
class(top_10_postcode_crime_2018)
## [1] "tbl_df" "tbl" "data.frame"
SQL queries can be submitted to BigQuery from R using the bq_perform_query function. bq_project_query creates an object of class bq_query and the query results can be viewed by passing the object to bq_table_download function.
sql_query <-
"SELECT Offence_category,sum(Dec_2018) Dec_2018,sum(Nov_2018) Nov_2018,sum(Oct_2018) Oct_2018 FROM `uts-mdsi.stds_assignment.crime_by_postcode`
group by Offence_category order by Dec_2018 desc limit 10;"
offence_qtr <- bq_project_query(projectid, sql_query)
bq_table_download(offence_qtr)
## # A tibble: 10 x 4
## Offence_category Dec_2018 Nov_2018 Oct_2018
## <chr> <int> <int> <int>
## 1 Theft 18383 19056 19144
## 2 Transport regulatory offences 8489 8791 8493
## 3 Against justice procedures 6282 5478 5487
## 4 Assault 6229 5411 5140
## 5 Malicious damage to property 5038 4933 5174
## 6 Drug offences 4260 3928 3165
## 7 Intimidation, stalking and harassment 2946 2890 2787
## 8 Disorderly conduct 1777 1671 1597
## 9 Other offences 1380 1185 1084
## 10 Liquor offences 1226 933 832
The dplyr package provides a higher level abstraction over the Google REST API’s and makes it convenient to use verbs to manipulate data. If you have a good understanding of REST API’s and want to do something that’s not supported by dplyr, then use of bigrquery and/or DBI packages would be most appropriate. To learn more about the functions in bigrquery package, please refer to https://www.rdocumentation.org/packages/bigrquery
Community Relations Division, freecall 1800 685 449 & Justice, N.D. of n.d., Datasets, viewed 29 March 2019, https://www.bocsar.nsw.gov.au:443/Pages/bocsar_datasets/Datasets-.aspx.
bigrquery package | R Documentation n.d., viewed 29 March 2019, https://www.rdocumentation.org/packages/bigrquery/versions/1.1.0.