dbplyr - Out-Source Your Data With TidyVerse

dbplyr is a great tool for unloading heavy amounts of data from RStudio and onto a SQL server. dbplyr is good choice when:

You have a data heavy environment that is slowing down your local machine
Your knits are taking forever
You have many files that access the same data and you don’t want to reload from csv everytime
You are working in a group

Set Up

dbplyr works with a SQL server to retrive data and make it available as a dataframe on your local machine. dbplyr handels the SQL queries on the back end to directly interface with the server and returns an object you can further manipulate using standard Tidyverse tools.

I will show a simple way to create a new database and populate it with data before demostrating how to use dbplyr.

You will need another package to access the SQL server of your choice. dbplyr works very well with standard, well-known SQL server packages. I will be using RMySQL and Google Cloud SQL for this example. SQLite is very popular for demostration purposes, but is not as practical in a real-world scenario.

Reproducability: The set up code for the SQL database is in markdown as I only need to run it once.

Create a new database on your server

Before we can get to dbplyr we need data and a SQL instance to tap into. Below is a very easy set up using the RMySQL package. You only need to use the follow code chunk once and the DB will be there as long as you keep it on your SQL server. One of the many wonderful things about storing and accessing data this way is that you only need to create one cloud instance and R and RMySQL can handle the rest - this is far superior to using MySQL workbench or some other software on your local machine.

conn <-                 # Create a connection to your SQL server
    dbConnect(
      MySQL(),
      username = "root",
      password = rstudioapi::askForPassword("Database password"),
      # For rmd on github use: rstudioapi
      # For knitting use params$pwd
      host = '34.68.193.229'
    )
# Create a DB using a SQL query. 
dbSendQuery(conn, "CREATE DATABASE Congress_v_Trump;")

Next, let the server know you intend to use the DB you created. If starting a new session (after creating the data), establish a connection and identify the database name.

conn <-                                     
    dbConnect(
      MySQL(),
      username = "root",
      password = rstudioapi::askForPassword("Database password"),
      # For rmd on github use: rstudioapi::askForPassword("Database password")
      # For knitting use params$pwd
      host = '34.68.193.229',
      dbname = "Congress_v_Trump"
    )

The data is stored on a .csv file and we can load the .csv as dataframe and use RMySQL to write that dataframe to our cloud database.

You only need to do this once. The data will stay on your SQL server as long as you want.

m<-getURL(
"https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/averages.csv")
df<-read.csv(text=m, header = T, stringsAsFactors = F)
#dbWriteTable(conn = conn, name = "Congress_v_Trump", value = df)

This data is now on our SQL server and ready to be used.

dbplyr

If you have used any dplyr functions before you’ll be very happy to know that almost everything is the same for dbplyr! I’ll show a few common use-cases as well as some places where dbplyr is different form dplyr.

tbl()

First, you need to let dbplyr know that you are using a database by pointing to a database and table. It’s one line - tell dbplyr the connection and name of the database.

If you use a html display for your data (DT in this case), you’ll have to cast the variable assigned to tbl() explicitly with as.data.frame()

#df<-tbl(conn, "Congress_v_Trump")

DT::datatable(as.data.frame(df),
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

Now we are ready to use dbplyr to query the database!

Let’s start with a simple SQL query as an example. We will find a given individual and aggregate their average agreement (agree_pct) with the president and their average number of votes in alignment with the president (net_trump_vote).

df %>%
  filter(last_name == "Sanders") %>%
  select(state, party, agree_pct, net_trump_vote) %>%
  group_by(state) %>%
  summarize(avg_agree = mean(agree_pct), avg_net_trump_vote = mean(net_trump_vote))

## # A tibble: 1 x 3
##   state avg_agree avg_net_trump_vote
##   <chr>     <dbl>              <dbl>
## 1 VT        0.112              -26.4

Let’s look at another example, looking at the same agreement value for all Democrats. I’m wrapping this in an html display for readibility.

DT::datatable(as.data.frame(
  # dbplyr query
  df %>%               
    filter(party == "D" | party == "Democrat") %>%
    select(state, last_name, agree_pct, net_trump_vote) %>%
    group_by(last_name, state) %>%
    summarize(max_agree = max(agree_pct, na.rm=T),
            max_net = max(net_trump_vote, na.rm=T))),
  # display parameters
     extensions = c('FixedColumns',"FixedHeader"),
     options = list(scrollX = TRUE,
                    paging=TRUE,
                    fixedHeader=TRUE,
                    pageLength = 7))

## Warning in max(agree_pct, na.rm = T): no non-missing arguments to max; returning
## -Inf

## Warning in max(net_trump_vote, na.rm = T): no non-missing arguments to max;
## returning -Inf

If we want to do more data cleaning and have the full use of tidyr or dplyr functions we will need to use dbplyr to pull down a subset of the database and store it in local memory. Note: Use the collect() function to save to the local memory. Also use collect() to make use of the slice() function to limit results.

df_subset<-df %>%
  select(last_name, state, party, agree_pct, net_trump_vote) %>%
  collect()          

DT::datatable(df_subset,
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

Once in local memory we can use things like mutate, or other tidyverse packages like stringr.

DT::datatable(df_subset %>%
                mutate(party=str_replace_all(party, "^R$", "Republican"))%>%
                mutate(party=str_replace_all(party, "^D$", "Democrat")) %>%
                mutate(party=str_replace_all(party, "^I$", "Independent")),
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

dbplyr is a great way to explore and refined data like you would in SQL before loading into R for data processing and analysis.

dbplyr - Tidyverse - Data 607

Jeff Shamp

2020-03-31

dbplyr - Out-Source Your Data With TidyVerse

Set Up

Create a new database on your server

dbplyr

tbl()