Introduction

A database is a structured set of data. Terminology is a little bit different when working with a database management system compared to working with data in R.

  • field: variable or quantity
  • record: collection of fields
  • table: collection of records with all the same fields
  • database: collection of tables

The relationship between R terminology and database terminology is explained below.

R terminology Database terminology
column field
row record
data frame table
types of columns table schema
collection of data frames database

SQL (structured query language) allows you to directly interact with a database and perform tasks such as pull data and make updates. R has two packages available that make it easy to run SQL queries: DBI, RSQLite.

Getting started

To get started, load packages DBI, RSQLite, tidyverse, and Lahman. Install any packages by entering command install.packages("package") in your Console pane.

Package DBI provides a database interface definition for communication between R and relational database management systems. Package RSQLite embeds the ‘SQLite’ database engine in R and provides an interface compliant with the package DBI.

library(DBI)
library(RSQLite)
library(tidyverse)
library(Lahman)

Open connection

Below we create an in-memory database that contains all the data frames from package Lahman. This database will only live during your current R session.

# set up connectiong
con <- dbConnect(SQLite(), ":memory:")

# add data frames in package Lahman to database
for (i in 1:dim(LahmanData)[1]) {
  dbWriteTable(con, LahmanData$file[i], get(LahmanData$file[i]))
}

Object con is the connection you pass into functions dbListTables(), dbListFields(), and dbGetQuery().

To verify that all the data frames were added to the database we can run code

# all tables
con %>%
  dbListTables()
 [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
 [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
 [7] "Batting"             "BattingPost"         "CollegePlaying"     
[10] "Fielding"            "FieldingOF"          "FieldingPost"       
[13] "HallOfFame"          "Managers"            "ManagersHalf"       
[16] "Master"              "Pitching"            "PitchingPost"       
[19] "Salaries"            "Schools"             "SeriesPost"         
[22] "Teams"               "TeamsFranchises"     "TeamsHalf"          

At the conclusion of this RMarkdown file there is a code chunk that closes the connection and frees resources. Run that chunk when you complete the document.

Joins with SQL

Last time you used SELECT and FROM corresponding to one table in a database. However, we may want to extract information from multiple tables. The JOIN option in your SQL query will help streamline this process.

There are 4 main types of join in SQL:

  1. INNER JOIN or just JOIN: retain just the rows for each table that match the condition
  2. LEFT OUTER JOIN or just LEFT JOIN: retain all rows in the first table, and just the rows in the second table that match the condition
  3. RIGHT OUTER JOIN or just RIGHT JOIN: retain just the rows in the first table that match the condition, and all rows in the second table
  4. FULL OUTER JOIN or just FULL JOIN: retain all rows in both tables

Fields that cannot be filled in are assigned NA values.

Recall that in package dplyr there are functions left_join(), right_join(), inner_join(), and full_join(), semi_join(), and anti_join().

Example

Suppose we want a table that contains the 10 players with the highest career home run average. The table should also include each player’s average career salary. This information is across two tables in our database.

# fields in batting
con %>%
  dbListFields(name = "Batting")
 [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "G"       
 [7] "AB"       "R"        "H"        "X2B"      "X3B"      "HR"      
[13] "RBI"      "SB"       "CS"       "BB"       "SO"       "IBB"     
[19] "HBP"      "SH"       "SF"       "GIDP"    
# fields in salaries
con %>% 
  dbListFields(name = "Salaries")
[1] "yearID"   "teamID"   "lgID"     "playerID" "salary"  

We can extract the information from both tables as a single table via JOIN on the pair: yearID, playerID. These variables form a key. The key uniquely identifies each row of a table.

To get what we want, use

con %>% 
  dbGetQuery(paste("SELECT playerID, AVG(HR), AVG(salary)",
                    "FROM Batting JOIN Salaries USING(yearID, playerID)",
                    "GROUP BY playerID",
                    "ORDER BY Avg(HR) DESC",
                    "LIMIT 10"))

We could have produced the same result with dplyr commands

Batting %>% 
  inner_join(Salaries, by = c("yearID", "playerID")) %>% 
  group_by(playerID) %>% 
  summarise(mean_HR = mean(HR),
            mean_salary = mean(salary)) %>%
  arrange(desc(mean_HR)) %>% 
  slice(1:10)

Join practice

In each task the result is displayed below the question in an HTML table.

Task 1

Use the JOIN option to merge the “Batting” and “Salaries” tables by matching yearID and playerID. Query the year, player, salary, at-bats, and hits for the first 10 records.

Year Player ID Salary At-bats Hits
1985 ackerji01 170000 0 0
1985 agostju01 147500 0 0
1985 aguaylu01 237000 165 46
1985 alexado01 875000 0 0
1985 allenne01 750000 2 0
1985 allenne01 750000 0 0
1985 almonbi01 255000 244 66
1985 anderla02 250500 4 0
1985 andujjo01 1030000 94 10
1985 armasto01 915000 385 102

Task 2

Use the JOIN option to merge the “Batting” and “Salaries” tables by matching yearID and playerID. Find the 10 players with the most strikeouts in a season since the year 2000. Also, display each player’s salary for that season in the resulting table.

Year Player ID Strikeouts Salary
2009 reynoma01 223 422500
2012 dunnad01 222 14000000
2016 davisch02 219 21268890
2013 cartech02 212 494000
2010 reynoma01 211 833333
2015 davisch02 208 12000000
2016 cartech02 206 2500000
2011 stubbdr01 205 450000
2008 reynoma01 204 396500
2007 howarry01 199 900000

Task 3

Find the average career salary for the 5 players who averaged the most strikeouts over their career. Only consider players since 2000 that had an average salary of at least 1 million dollars. Include each player’s first and last name.

First Last Average strikeouts Average salary
Chris Carter 187.7500 1919750
Chris Davis 165.4286 6889661
Ryan Howard 161.6000 15525500
Mike Trout 153.7500 5919083
Adam Dunn 153.6667 9379667

Task 4

Use the JOIN option to merge “Salaries” and “Teams” tables by matching yearID, playerID, and teamID. Find the 10 highest salaries by team in 2016.

Team Total salary
DET 194876481
BOS 188545761
TEX 176038723
BAL 161863456
TOR 138701700
LAA 137251333
SEA 135683339
COL 112645071
PIT 103778833
MIN 102583200

Task 5

Use function dbReadTable() to extract tables “Salaries” and “Teams”. Save these as data frames in R with object names sal.tab2df and teams.tab2df, respectively. Use one of the dplyr joins and other dplyr functions to produce the same result as in Task 4.

Close connection

dbDisconnect(con)