Step 1: Establish a connection

The first step to import data from a SQL database is creating a connection to it. You need different packages depending on the database you want to connect to. All these package do this in a uniform way, as specified in the DBI package.

dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().

If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password. Most of these details have already been provided.

##install.packages("RMySQL")
# Load the DBI package
library(DBI)
## Warning: package 'DBI' was built under R version 3.2.5
# Edit dbConnect() call
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")
  
# Print out con
con
## <MySQLConnection:0,0>

Step 2: List the database tables

After you’ve successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the dbListTables() function. This function requires the connection object as an input, and outputs a character vector with the table names.

# Build a vector of table names: tables
tables <- dbListTables(con)

# Display structure of tables
str(tables)
##  chr [1:4] "comments" "employees" "tweats" "users"

Step 3: Import data from a table

As you might have guessed by now, the database contains data on a more tasty version of Twitter, namely Tweater. Users can post tweats with short recipes for delicious snacks. People can comment on these tweats. There are three tables: users, tweats and comments that have relations among them. Which ones, you ask? You’ll discover in a moment!

First, you have to import the Tweater data into your R session. You do this with the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import. The resulting object is an standard R data frame.

# Import the users table from tweater: users
users <- dbReadTable(con, "users")

# Print users
users
##   id      name     login
## 1  1 elisabeth  elismith
## 2  2      mike     mikey
## 3  3      thea   teatime
## 4  4    thomas tomatotom
## 5  5    oliver olivander
## 6  6      kate  katebenn
## 7  7    anjali    lianja
# Import and print the tweats table from tweater: tweats
tweats <- dbReadTable(con, "tweats")
tweats
##   id user_id
## 1 75       3
## 2 88       4
## 3 77       6
## 4 87       5
## 5 49       1
## 6 24       7
##                                                                  post
## 1                                       break egg. bake egg. eat egg.
## 2                           wash strawberries. add ice. blend. enjoy.
## 3                       2 slices of bread. add cheese. grill. heaven.
## 4               open and crush avocado. add shrimps. perfect starter.
## 5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 6                              just eat an apple. simply and healthy.
##         date
## 1 2015-09-05
## 2 2015-09-14
## 3 2015-09-21
## 4 2015-09-22
## 5 2015-09-22
## 6 2015-09-24
# Import and print the comments table from tweater: comments
comments <- dbReadTable(con, "comments")
comments
##      id tweat_id user_id            message
## 1  1022       87       7              nice!
## 2  1000       77       7             great!
## 3  1011       49       5            love it
## 4  1012       87       1   awesome! thanks!
## 5  1010       88       6              yuck!
## 6  1026       77       4      not my thing!
## 7  1004       49       1  this is fabulous!
## 8  1030       75       6           so easy!
## 9  1025       88       2             oh yes
## 10 1007       49       3           serious?
## 11 1020       77       1 couldn't be better
## 12 1014       77       1       saved my day

How do the tables relate?

The connection to the MySQL database con has already been created for you. The three data frames (users, tweats and comments) that you created in the previous exercises are also available.

If you have a closer look at these tables, you’ll see that the tweats data frame, for example, contains a column user_id. The ids in the column refer to the users that have posted the tweat. Similarly, the comments data frame contains both an user_id and a tweat_id column. It specifies which user posted a comment on which tweat.

With this new knowledge, can you tell who posted the tweat on which somebody commented “awesome! thanks!” (comment 1012)?

  • The user with user_id 5, so Oliver.

Your very first SQL query

In your life as a data scientist, you’ll often be working with huge databases that contain tables with millions of rows. If want to do some analyses on this data, it’s possible that you only need a fraction of this data. In this case, it’s a good idea to send SQL queries to your database, and only import the data you actually need into R.

dbGetQuery() is what you need. As usual, you first pass it the connection object. The second argument is an SQL query in the form of a character string. This example selects the age variable from the people dataset where gender equals “male”:

# Import post column of tweats where date is higher than "2015-09-21": latest
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > \"2015-09-21\"")

# Import tweat_id column of comments where user_id is 1: elisabeth
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")

# Print latest and elisabeth
latest 
##                                                                  post
## 1               open and crush avocado. add shrimps. perfect starter.
## 2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 3                              just eat an apple. simply and healthy.
elisabeth
##   tweat_id
## 1       87
## 2       49
## 3       77
## 4       77

More advanced SQL queries

The SQL query example given in the previous example, was the following:

## SELECT age 
##  FROM people 
##    WHERE gender = "male"

Suppose now that the people table also contains a country and married column you want to import. This time, you want to find out the age and country of married males. The query can be extended as follows, provided that there is a married column that’s 1 when the person in question is married:

## SELECT age, country 
##  FROM people 
##    WHERE gender = "male" AND married = 1
# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = 77 AND user_id > 4")

# Create data frame short
short <- dbGetQuery(con, "SELECT id, name FROM users WHERE CHAR_LENGTH(name) < 5")

# Print specific and short
specific
##   message
## 1  great!
short
##   id name
## 1  2 mike
## 2  3 thea
## 3  6 kate

Join the query madness!

Of course, SQL does not stop with the the three keywords SELECT, FROM and WHERE. Another very often used keyword is JOIN, and more specifically INNER JOIN. Take this call for example:

## SELECT name, post 
##  FROM users INNER JOIN tweats on users.id = user_id
##    WHERE date > "2015-09-19"

Here, the users table has been joined with the tweats table. This is possible because the id column in the users corresponds to the user_id column in the tweats table. Also notice how name, from the users table, and post and date, from the tweats table, can be referenced to without problems.

  • Can you predict the outcome of the following query?
specific2 <- dbGetQuery(con, "SELECT post, message FROM tweats INNER JOIN comments on tweats.id = tweat_id WHERE tweat_id = 77")
specific2 
##                                            post            message
## 1 2 slices of bread. add cheese. grill. heaven.             great!
## 2 2 slices of bread. add cheese. grill. heaven.      not my thing!
## 3 2 slices of bread. add cheese. grill. heaven. couldn't be better
## 4 2 slices of bread. add cheese. grill. heaven.       saved my day

Send - Fetch - Clear

In the previous exercises, the dbGetQuery() function was used. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:

  • Sending the specified query with dbSendQuery();
  • Fetching the result of executing the query on the database with dbFetch();
  • Clearing the result with dbClearResult(). ### Let’s not choose use dbGetQuery() for once and use the above steps. This is tedious to write, but it gives you the ability to fetch the query’s result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().
# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")

# Use dbFetch() twice
dbFetch(res, n=2)
##     id tweat_id user_id message
## 1 1022       87       7   nice!
## 2 1000       77       7  great!
dbFetch(res)
##     id tweat_id user_id  message
## 1 1011       49       5  love it
## 2 1010       88       6    yuck!
## 3 1030       75       6 so easy!
# Clear res
dbClearResult(res)
## [1] TRUE

Be polite and …

Every time you connect to a database using dbConnect(), you’re creating a new connection to the database you’re referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections if you’re exaggerating, but still: it’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.

# Create the data frame  long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) >40")

# Print long_tweats
print(long_tweats)
##                                                                  post
## 1                           wash strawberries. add ice. blend. enjoy.
## 2                       2 slices of bread. add cheese. grill. heaven.
## 3               open and crush avocado. add shrimps. perfect starter.
## 4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
##         date
## 1 2015-09-14
## 2 2015-09-21
## 3 2015-09-22
## 4 2015-09-22
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE