A database is a structured set of data. Terminology is a little bit different when working with a database management system compared to working with data in R.
The relationship between R terminology and database terminology is explained below.
R terminology | Database terminology |
---|---|
column | field |
row | record |
data frame | table |
types of columns | table schema |
collection of data frames | database |
SQL (structured query language) allows you to directly interact with a database and perform tasks such as pull data and make updates. R has two packages available that make it easy to run SQL queries: DBI
, RSQLite
.
To get started, load packages DBI
, RSQLite
, tidyverse
, and Lahman
. Install any packages by entering command install.packages("package")
in your Console pane.
Package DBI
provides a database interface definition for communication between R and relational database management systems. Package RSQLite
embeds the ‘SQLite’ database engine in R and provides an interface compliant with the package DBI
.
library(DBI)
library(RSQLite)
library(tidyverse)
library(Lahman)
Below we create an in-memory database that contains all the data frames from package Lahman
. This database will only live during your current R session.
# set up connectiong
con <- dbConnect(SQLite(), ":memory:")
# add data frames in package Lahman to database
for (i in 1:dim(LahmanData)[1]) {
dbWriteTable(con, LahmanData$file[i], get(LahmanData$file[i]))
}
Object con
is the connection you pass into functions dbListTables()
, dbListFields()
, and dbGetQuery()
.
To verify that all the data frames were added to the database we can run code
# all tables
con %>%
dbListTables()
[1] "AllstarFull" "Appearances" "AwardsManagers"
[4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost" "CollegePlaying"
[10] "Fielding" "FieldingOF" "FieldingPost"
[13] "HallOfFame" "Managers" "ManagersHalf"
[16] "Master" "Pitching" "PitchingPost"
[19] "Salaries" "Schools" "SeriesPost"
[22] "Teams" "TeamsFranchises" "TeamsHalf"
At the conclusion of this RMarkdown file there is a code chunk that closes the connection and frees resources. Run that chunk when you complete the document.
Last time you used SELECT
and FROM
corresponding to one table in a database. However, we may want to extract information from multiple tables. The JOIN
option in your SQL query will help streamline this process.
There are 4 main types of join in SQL:
INNER JOIN
or just JOIN
: retain just the rows for each table that match the conditionLEFT OUTER JOIN
or just LEFT JOIN
: retain all rows in the first table, and just the rows in the second table that match the conditionRIGHT OUTER JOIN
or just RIGHT JOIN
: retain just the rows in the first table that match the condition, and all rows in the second tableFULL OUTER JOIN
or just FULL JOIN
: retain all rows in both tablesFields that cannot be filled in are assigned NA values.
Recall that in package dplyr
there are functions left_join()
, right_join()
, inner_join()
, and full_join()
, semi_join()
, and anti_join()
.
Suppose we want a table that contains the 10 players with the highest career home run average. The table should also include each player’s average career salary. This information is across two tables in our database.
# fields in batting
con %>%
dbListFields(name = "Batting")
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "AB" "R" "H" "X2B" "X3B" "HR"
[13] "RBI" "SB" "CS" "BB" "SO" "IBB"
[19] "HBP" "SH" "SF" "GIDP"
# fields in salaries
con %>%
dbListFields(name = "Salaries")
[1] "yearID" "teamID" "lgID" "playerID" "salary"
We can extract the information from both tables as a single table via JOIN on the pair: yearID
, playerID
. These variables form a key. The key uniquely identifies each row of a table.
To get what we want, use
con %>%
dbGetQuery(paste("SELECT playerID, AVG(HR), AVG(salary)",
"FROM Batting JOIN Salaries USING(yearID, playerID)",
"GROUP BY playerID",
"ORDER BY Avg(HR) DESC",
"LIMIT 10"))
We could have produced the same result with dplyr
commands
Batting %>%
inner_join(Salaries, by = c("yearID", "playerID")) %>%
group_by(playerID) %>%
summarise(mean_HR = mean(HR),
mean_salary = mean(salary)) %>%
arrange(desc(mean_HR)) %>%
slice(1:10)
In each task the result is displayed below the question in an HTML table.
Use the JOIN
option to merge the “Batting” and “Salaries” tables by matching yearID
and playerID
. Query the year, player, salary, at-bats, and hits for the first 10 records.
Year | Player ID | Salary | At-bats | Hits |
---|---|---|---|---|
1985 | ackerji01 | 170000 | 0 | 0 |
1985 | agostju01 | 147500 | 0 | 0 |
1985 | aguaylu01 | 237000 | 165 | 46 |
1985 | alexado01 | 875000 | 0 | 0 |
1985 | allenne01 | 750000 | 2 | 0 |
1985 | allenne01 | 750000 | 0 | 0 |
1985 | almonbi01 | 255000 | 244 | 66 |
1985 | anderla02 | 250500 | 4 | 0 |
1985 | andujjo01 | 1030000 | 94 | 10 |
1985 | armasto01 | 915000 | 385 | 102 |
Use the JOIN
option to merge the “Batting” and “Salaries” tables by matching yearID
and playerID
. Find the 10 players with the most strikeouts in a season since the year 2000. Also, display each player’s salary for that season in the resulting table.
Year | Player ID | Strikeouts | Salary |
---|---|---|---|
2009 | reynoma01 | 223 | 422500 |
2012 | dunnad01 | 222 | 14000000 |
2016 | davisch02 | 219 | 21268890 |
2013 | cartech02 | 212 | 494000 |
2010 | reynoma01 | 211 | 833333 |
2015 | davisch02 | 208 | 12000000 |
2016 | cartech02 | 206 | 2500000 |
2011 | stubbdr01 | 205 | 450000 |
2008 | reynoma01 | 204 | 396500 |
2007 | howarry01 | 199 | 900000 |
Find the average career salary for the 5 players who averaged the most strikeouts over their career. Only consider players since 2000 that had an average salary of at least 1 million dollars. Include each player’s first and last name.
First | Last | Average strikeouts | Average salary |
---|---|---|---|
Chris | Carter | 187.7500 | 1919750 |
Chris | Davis | 165.4286 | 6889661 |
Ryan | Howard | 161.6000 | 15525500 |
Mike | Trout | 153.7500 | 5919083 |
Adam | Dunn | 153.6667 | 9379667 |
Use the JOIN
option to merge “Salaries” and “Teams” tables by matching yearID
, playerID
, and teamID
. Find the 10 highest salaries by team in 2016.
Team | Total salary |
---|---|
DET | 194876481 |
BOS | 188545761 |
TEX | 176038723 |
BAL | 161863456 |
TOR | 138701700 |
LAA | 137251333 |
SEA | 135683339 |
COL | 112645071 |
PIT | 103778833 |
MIN | 102583200 |
Use function dbReadTable()
to extract tables “Salaries” and “Teams”. Save these as data frames in R with object names sal.tab2df
and teams.tab2df
, respectively. Use one of the dplyr
joins and other dplyr
functions to produce the same result as in Task 4.
dbDisconnect(con)