Introduction

A database is a structured set of data. Terminology is a little bit different when working with a database management system compared to working with data in R.

field: variable or quantity
record: collection of fields
table: collection of records with all the same fields
database: collection of tables

The relationship between R terminology and database terminology is explained below.

R terminology	Database terminology
column	field
row	record
data frame	table
types of columns	table schema
collection of data frames	database

SQL (structured query language) allows you to directly interact with a database and perform tasks such as pull data and make updates. R has two packages available that make it easy to run SQL queries: DBI, RSQLite.

Getting started

To get started, load packages DBI, RSQLite, tidyverse, and Lahman. Install any packages by entering command install.packages("package") in your Console pane.

Package DBI provides a database interface definition for communication between R and relational database management systems. Package RSQLite embeds the ‘SQLite’ database engine in R and provides an interface compliant with the package DBI.

library(DBI)
library(RSQLite)
library(tidyverse)
library(Lahman)

Open connection

Below we create an in-memory database that contains all the data frames from package Lahman. This database will only live during your current R session.

# set up connectiong
con <- dbConnect(SQLite(), ":memory:")

# add data frames in package Lahman to database
for (i in 1:dim(LahmanData)[1]) {
  dbWriteTable(con, LahmanData$file[i], get(LahmanData$file[i]))
}

Object con is the connection you pass into functions dbListTables(), dbListFields(), and dbGetQuery().

To verify that all the data frames were added to the database we can run code

# all tables
con %>%
  dbListTables()

 [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
 [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
 [7] "Batting"             "BattingPost"         "CollegePlaying"     
[10] "Fielding"            "FieldingOF"          "FieldingPost"       
[13] "HallOfFame"          "Managers"            "ManagersHalf"       
[16] "Master"              "Pitching"            "PitchingPost"       
[19] "Salaries"            "Schools"             "SeriesPost"         
[22] "Teams"               "TeamsFranchises"     "TeamsHalf"

At the conclusion of this RMarkdown file there is a code chunk that closes the connection and frees resources. Run that chunk when you complete the document.

Joins with SQL

Last time you used SELECT and FROM corresponding to one table in a database. However, we may want to extract information from multiple tables. The JOIN option in your SQL query will help streamline this process.

There are 4 main types of join in SQL:

INNER JOIN or just JOIN: retain just the rows for each table that match the condition
LEFT OUTER JOIN or just LEFT JOIN: retain all rows in the first table, and just the rows in the second table that match the condition
RIGHT OUTER JOIN or just RIGHT JOIN: retain just the rows in the first table that match the condition, and all rows in the second table
FULL OUTER JOIN or just FULL JOIN: retain all rows in both tables

Fields that cannot be filled in are assigned NA values.

Recall that in package dplyr there are functions left_join(), right_join(), inner_join(), and full_join(), semi_join(), and anti_join().

Example

Suppose we want a table that contains the 10 players with the highest career home run average. The table should also include each player’s average career salary. This information is across two tables in our database.

# fields in batting
con %>%
  dbListFields(name = "Batting")

 [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "G"       
 [7] "AB"       "R"        "H"        "X2B"      "X3B"      "HR"      
[13] "RBI"      "SB"       "CS"       "BB"       "SO"       "IBB"     
[19] "HBP"      "SH"       "SF"       "GIDP"

# fields in salaries
con %>% 
  dbListFields(name = "Salaries")

[1] "yearID"   "teamID"   "lgID"     "playerID" "salary"

We can extract the information from both tables as a single table via JOIN on the pair: yearID, playerID. These variables form a key. The key uniquely identifies each row of a table.

To get what we want, use

con %>% 
  dbGetQuery(paste("SELECT playerID, AVG(HR), AVG(salary)",
                    "FROM Batting JOIN Salaries USING(yearID, playerID)",
                    "GROUP BY playerID",
                    "ORDER BY Avg(HR) DESC",
                    "LIMIT 10"))

We could have produced the same result with dplyr commands

Batting %>% 
  inner_join(Salaries, by = c("yearID", "playerID")) %>% 
  group_by(playerID) %>% 
  summarise(mean_HR = mean(HR),
            mean_salary = mean(salary)) %>%
  arrange(desc(mean_HR)) %>% 
  slice(1:10)

Join practice

In each task the result is displayed below the question in an HTML table.

Task 1

Use the JOIN option to merge the “Batting” and “Salaries” tables by matching yearID and playerID. Query the year, player, salary, at-bats, and hits for the first 10 records.

Year	Player ID	Salary	At-bats	Hits
1985	ackerji01	170000	0	0
1985	agostju01	147500	0	0
1985	aguaylu01	237000	165	46
1985	alexado01	875000	0	0
1985	allenne01	750000	2	0
1985	allenne01	750000	0	0
1985	almonbi01	255000	244	66
1985	anderla02	250500	4	0
1985	andujjo01	1030000	94	10
1985	armasto01	915000	385	102

Task 2

Use the JOIN option to merge the “Batting” and “Salaries” tables by matching yearID and playerID. Find the 10 players with the most strikeouts in a season since the year 2000. Also, display each player’s salary for that season in the resulting table.

Year	Player ID	Strikeouts	Salary
2009	reynoma01	223	422500
2012	dunnad01	222	14000000
2016	davisch02	219	21268890
2013	cartech02	212	494000
2010	reynoma01	211	833333
2015	davisch02	208	12000000
2016	cartech02	206	2500000
2011	stubbdr01	205	450000
2008	reynoma01	204	396500
2007	howarry01	199	900000

Task 3

Find the average career salary for the 5 players who averaged the most strikeouts over their career. Only consider players since 2000 that had an average salary of at least 1 million dollars. Include each player’s first and last name.

First	Last	Average strikeouts	Average salary
Chris	Carter	187.7500	1919750
Chris	Davis	165.4286	6889661
Ryan	Howard	161.6000	15525500
Mike	Trout	153.7500	5919083
Adam	Dunn	153.6667	9379667

Task 4

Use the JOIN option to merge “Salaries” and “Teams” tables by matching yearID, playerID, and teamID. Find the 10 highest salaries by team in 2016.

Team	Total salary
DET	194876481
BOS	188545761
TEX	176038723
BAL	161863456
TOR	138701700
LAA	137251333
SEA	135683339
COL	112645071
PIT	103778833
MIN	102583200

Task 5

Use function dbReadTable() to extract tables “Salaries” and “Teams”. Save these as data frames in R with object names sal.tab2df and teams.tab2df, respectively. Use one of the dplyr joins and other dplyr functions to produce the same result as in Task 4.

Close connection

dbDisconnect(con)

Resources

References

http://www.stat.cmu.edu/~ryantibs/statcomp-F16/lectures/sql_join.html

SQL and R II

Shawn Santo

April 23, 2019