Step 1: Establish a connection
The first step to import data from a SQL database is creating a connection to it. You need different packages depending on the database you want to connect to. All these package do this in a uniform way, as specified in the DBI package.
dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().
If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password. Most of these details have already been provided.
##install.packages("RMySQL")
# Load the DBI package
library(DBI)
## Warning: package 'DBI' was built under R version 3.2.5
# Edit dbConnect() call
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Print out con
con
## <MySQLConnection:0,0>
Step 2: List the database tables
Step 3: Import data from a table
As you might have guessed by now, the database contains data on a more tasty version of Twitter, namely Tweater. Users can post tweats with short recipes for delicious snacks. People can comment on these tweats. There are three tables: users, tweats and comments that have relations among them. Which ones, you ask? You’ll discover in a moment!
First, you have to import the Tweater data into your R session. You do this with the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import. The resulting object is an standard R data frame.
# Import the users table from tweater: users
users <- dbReadTable(con, "users")
# Print users
users
## id name login
## 1 1 elisabeth elismith
## 2 2 mike mikey
## 3 3 thea teatime
## 4 4 thomas tomatotom
## 5 5 oliver olivander
## 6 6 kate katebenn
## 7 7 anjali lianja
# Import and print the tweats table from tweater: tweats
tweats <- dbReadTable(con, "tweats")
tweats
## id user_id
## 1 75 3
## 2 88 4
## 3 77 6
## 4 87 5
## 5 49 1
## 6 24 7
## post
## 1 break egg. bake egg. eat egg.
## 2 wash strawberries. add ice. blend. enjoy.
## 3 2 slices of bread. add cheese. grill. heaven.
## 4 open and crush avocado. add shrimps. perfect starter.
## 5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 6 just eat an apple. simply and healthy.
## date
## 1 2015-09-05
## 2 2015-09-14
## 3 2015-09-21
## 4 2015-09-22
## 5 2015-09-22
## 6 2015-09-24
# Import and print the comments table from tweater: comments
comments <- dbReadTable(con, "comments")
comments
## id tweat_id user_id message
## 1 1022 87 7 nice!
## 2 1000 77 7 great!
## 3 1011 49 5 love it
## 4 1012 87 1 awesome! thanks!
## 5 1010 88 6 yuck!
## 6 1026 77 4 not my thing!
## 7 1004 49 1 this is fabulous!
## 8 1030 75 6 so easy!
## 9 1025 88 2 oh yes
## 10 1007 49 3 serious?
## 11 1020 77 1 couldn't be better
## 12 1014 77 1 saved my day
How do the tables relate?
The connection to the MySQL database con has already been created for you. The three data frames (users, tweats and comments) that you created in the previous exercises are also available.
If you have a closer look at these tables, you’ll see that the tweats data frame, for example, contains a column user_id. The ids in the column refer to the users that have posted the tweat. Similarly, the comments data frame contains both an user_id and a tweat_id column. It specifies which user posted a comment on which tweat.
With this new knowledge, can you tell who posted the tweat on which somebody commented “awesome! thanks!” (comment 1012)?
- The user with user_id 5, so Oliver.
Your very first SQL query
In your life as a data scientist, you’ll often be working with huge databases that contain tables with millions of rows. If want to do some analyses on this data, it’s possible that you only need a fraction of this data. In this case, it’s a good idea to send SQL queries to your database, and only import the data you actually need into R.
More advanced SQL queries
The SQL query example given in the previous example, was the following:
## SELECT age
## FROM people
## WHERE gender = "male"
Suppose now that the people table also contains a country and married column you want to import. This time, you want to find out the age and country of married males. The query can be extended as follows, provided that there is a married column that’s 1 when the person in question is married:
## SELECT age, country
## FROM people
## WHERE gender = "male" AND married = 1
# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = 77 AND user_id > 4")
# Create data frame short
short <- dbGetQuery(con, "SELECT id, name FROM users WHERE CHAR_LENGTH(name) < 5")
# Print specific and short
specific
## message
## 1 great!
short
## id name
## 1 2 mike
## 2 3 thea
## 3 6 kate
Join the query madness!
Of course, SQL does not stop with the the three keywords SELECT, FROM and WHERE. Another very often used keyword is JOIN, and more specifically INNER JOIN. Take this call for example:
## SELECT name, post
## FROM users INNER JOIN tweats on users.id = user_id
## WHERE date > "2015-09-19"
Here, the users table has been joined with the tweats table. This is possible because the id column in the users corresponds to the user_id column in the tweats table. Also notice how name, from the users table, and post and date, from the tweats table, can be referenced to without problems.
- Can you predict the outcome of the following query?
specific2 <- dbGetQuery(con, "SELECT post, message FROM tweats INNER JOIN comments on tweats.id = tweat_id WHERE tweat_id = 77")
specific2
## post message
## 1 2 slices of bread. add cheese. grill. heaven. great!
## 2 2 slices of bread. add cheese. grill. heaven. not my thing!
## 3 2 slices of bread. add cheese. grill. heaven. couldn't be better
## 4 2 slices of bread. add cheese. grill. heaven. saved my day
Be polite and …
Every time you connect to a database using dbConnect(), you’re creating a new connection to the database you’re referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections if you’re exaggerating, but still: it’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.
# Create the data frame long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) >40")
# Print long_tweats
print(long_tweats)
## post
## 1 wash strawberries. add ice. blend. enjoy.
## 2 2 slices of bread. add cheese. grill. heaven.
## 3 open and crush avocado. add shrimps. perfect starter.
## 4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## date
## 1 2015-09-14
## 2 2015-09-21
## 3 2015-09-22
## 4 2015-09-22
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE