Source file ⇒ lec36.Rmd
The primary functions to read files in R are scan()
and readLines()
.
readLines()
is the workhorse function to read raw text in R as character strings.
scan()
is a low-level function for reading data values, and it is extended by read.table()
and its related functions.
A connection is an object that tells R to be prepared for opening a data source (eg file in local directory, or a file url). You can think of a connection to a pointer to the file in memory.
Connections can be used anywhere a file name could be passed to functions like scan()
or read.table()
or writeLines()
.
Example:
con <- file("example")
open(con, "w")
writeLines("A line", con)
writeLines("Another line", con)
close(con)
readLines("example")
## [1] "A line" "Another line"
unlink("example")
Connections provide a means to have more control over the way R will “comunicate” with the resources to be read (or written) .When you write or read the file, the connection (i.e. pointer) keeps place of where you are in the file.
The above example requires a connection because we are appending to the file ex.data
.
One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readLines
and writeLines
functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:
# Read the first 30 lines, 10 lines at a time
con <- file("http://jeroenooms.github.io/data/diamonds.json")
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
close(con)
data1 %>% head(2)
## [1] "{\"carat\":0.23,\"cut\":\"Ideal\",\"color\":\"E\",\"clarity\":\"SI2\",\"depth\":61.5,\"table\":55,\"price\":326,\"x\":3.95,\"y\":3.98,\"z\":2.43}"
## [2] "{\"carat\":0.21,\"cut\":\"Premium\",\"color\":\"E\",\"clarity\":\"SI1\",\"depth\":59.8,\"table\":61,\"price\":326,\"x\":3.89,\"y\":3.84,\"z\":2.31}"
But much of the time you don’t need a connection and it is working behind the scenes when you read and write files.
We use a connection when reading and writing to databases with sqlite
and other database programs.
dplyr
The purpose of databases is to hold large data sets on a server. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle.
For example, a library stores details of all their books in a database. When you want to know if a book is in stock you can enter either the title, author or ISBN number and search for information about the book. You can find out how many copies are stored not only by your local library but also libraries in neighbouring towns. You can check when the book is due to be returned and also reserve it.
The database also records details of all the borrowers, what books they currently have out on loan and when they are due back. When they return their books the librarian will be informed if they are overdue and whether there are any fines outstanding.
MySQL and sqlite are two examples of database programs that create and manage databases. SQLite is file-based. MySQL is server based. SQL is a query language used to interact with those and other database programs. MySQL and sqlite use SQL syntax.
Since R almost exclusively works with in-memory data, if you do have a lot of data in a database, you can’t just dump it into R. Instead, you’ll have to work with subsets of your data.
To experiement with databases, it’s easiest to get started with SQLite because everything you need is included in the R package. You don’t need to install anything else, and you don’t need to deal with the hassle of setting up a database server.
The dplyr
package is the easiest way to work with databases in R. You guys already have experience with a lot of SQL
like commands in dplyr
like select
, mutate
, group_by
, summarize
, arrange
etc. Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql) and google’s bigquery
.
We use the function nycflights13_sqlite()
in dplyr
to form a connection with the database and the function tbl()
in the dplyr
package to read a table in the database.
#install.packages("nycflights13")
library(nycflights13)
tbl(nycflights13_sqlite(),"flights" )
## Caching nycflights db at /var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/nycflights13.sqlite
## Creating table: airlines
## Creating table: airports
## Creating table: flights
## Creating table: planes
## Creating table: weather
## Source: sqlite 3.8.6 [/var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/nycflights13.sqlite]
## From: flights [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
tbl(nycflights13_sqlite(), "planes")
## Source: sqlite 3.8.6 [/var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/nycflights13.sqlite]
## From: planes [3,322 x 9]
##
## tailnum year type manufacturer model
## (chr) (int) (chr) (chr) (chr)
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR
## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR
## 6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## 7 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## 8 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## 9 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## 10 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214
## .. ... ... ... ... ...
## Variables not shown: engines (int), seats (int), speed (int), engine (chr)
```
You can load and install a baseball database in R
#install.packages("Lahman")
library(Lahman)
##
## Attaching package: 'Lahman'
## The following object is masked from 'package:car':
##
## Salaries
con_batting <- tbl(lahman_sqlite(), "Batting")
## Creating table: AllstarFull
## Creating table: Appearances
## Creating table: AwardsManagers
## Creating table: AwardsPlayers
## Creating table: AwardsShareManagers
## Creating table: AwardsSharePlayers
## Creating table: Batting
## Creating table: BattingPost
## Creating table: CollegePlaying
## Creating table: Fielding
## Creating table: FieldingOF
## Creating table: FieldingPost
## Creating table: HallOfFame
## Creating table: LahmanData
## Creating table: Managers
## Creating table: ManagersHalf
## Creating table: Master
## Creating table: Pitching
## Creating table: PitchingPost
## Creating table: Salaries
## Creating table: Schools
## Creating table: SeriesPost
## Creating table: Teams
## Creating table: TeamsFranchises
## Creating table: TeamsHalf
con_batting
## Source: sqlite 3.8.6 [/var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/lahman.sqlite]
## From: Batting [99,846 x 22]
##
## playerID yearID stint teamID lgID G AB R H X2B X3B
## (chr) (int) (int) (chr) (chr) (int) (int) (int) (int) (int) (int)
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1
## 7 barkeal01 1871 1 RC1 NA 1 4 0 1 0 0
## 8 barnero01 1871 1 BS1 NA 31 157 66 63 10 9
## 9 barrebi01 1871 1 FW1 NA 1 5 1 1 1 0
## 10 barrofr01 1871 1 BS1 NA 18 86 13 13 2 1
## .. ... ... ... ... ... ... ... ... ... ... ...
## Variables not shown: HR (int), RBI (int), SB (int), CS (int), BB (int), SO
## (int), IBB (int), HBP (int), SH (int), SF (int), GIDP (int)
Batting %>%
group_by(playerID) %>%
summarise(total = sum(G)) %>%
arrange(desc(total)) %>%
head(5)
## Source: local data frame [5 x 2]
##
## playerID total
## (chr) (int)
## 1 rosepe01 3562
## 2 yastrca01 3308
## 3 aaronha01 3298
## 4 henderi01 3081
## 5 cobbty01 3035
It is possible you already have a database on your computer.
Lets load on our computer the nycflights13
database from the nycflights13
package.
We use the funciton src_sqlite()
in dplyr
to form a connection to an existing sqlite database or create a new one.
con <- src_sqlite(path="~/Desktop/my_db.sqlite3", create = TRUE)
Here src_sqlite
forms a connection, con
, of the file on your computer. Since the file my_db.sqlite3
doesn’t exist it will create it. my_db.sqlite3
is called a remote src. Think of a src as a collection of tables (i.e. a database)
You can use src_sqlite()
, src_mysql()
, src_postgres()
and src_bigquery()
to connect to the different databases supported by dplyr.
my_db.sqlite3
currently has no data in it, so we’ll load it up with the flights data using the copy_to()
function.
library(nycflights13)
copy_to(con, airlines,index=list("carrier","name"))
## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: airlines [16 x 2]
##
## carrier name
## (chr) (chr)
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
copy_to
copies a table (in this case flights) via the connection con
to the remote src my_db.sqlite3
on my computer).
We use the function tbl()
to read a table in the database (my_db.sqlite3
or nycflights_sqlite()
). Value of tbl
is a table object. A table object only print a few rows and all the columns that fit on one screen, describing the rest of it as text.
airlines_sqlite <- tbl(con, "airlines")
airlines_sqlite
## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: airlines [16 x 2]
##
## carrier name
## (chr) (chr)
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
Mysql allows you to read data in from a remote database.
Basic syntax:
# To connect to a database first create a src:
con <- src_mysql(host = "blah.com", user = "adam", password = "pass")
# Then reference a tbl within that src
my_tbl <- tbl(con, "my_table")
Although nycflights13 is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database’s native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.
con <- src_mysql(dbname = "dplyr",
host = "dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "dplyr",
password = "dplyr")
nycflights <- tbl(con, "dplyr")
nycflights
## Source: mysql 5.6.21-log [dplyr@dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## From: dplyr [336,776 x 17]
##
## id year month day dep_time dep_delay arr_time arr_delay carrier
## (int) (int) (int) (int) (int) (int) (int) (int) (chr)
## 1 1 2013 1 1 517 2 830 11 UA
## 2 2 2013 1 1 533 4 850 20 UA
## 3 3 2013 1 1 542 2 923 33 AA
## 4 4 2013 1 1 544 -1 1004 -18 B6
## 5 5 2013 1 1 554 -6 812 -25 DL
## 6 6 2013 1 1 554 -4 740 12 UA
## 7 7 2013 1 1 555 -5 913 19 B6
## 8 8 2013 1 1 557 -3 709 -14 EV
## 9 9 2013 1 1 557 -3 838 -8 B6
## 10 10 2013 1 1 558 -2 753 8 AA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: tailnum (chr), flight (int), origin (chr), dest
## (chr), air_time (int), distance (int), hour (int), minute (int)
# We do some data wrangling: Group nycflights data by carrier, then summarise() with two variables: n_flights, the number of flights flown by each carrier and avg_delay, the average arrival delay of flights flown by each carrier. Finally, arrange the carriers by average delay from low to high.
nycflights %>% group_by(carrier) %>% summarise(n_flights=n(), avg_delay=mean(arr_delay)) %>% arrange(avg_delay)
## Source: mysql 5.6.21-log [dplyr@dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## From: <derived table> [?? x 3]
## Arrange: avg_delay
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## carrier n_flights avg_delay
## (chr) (dbl) (dbl)
## 1 AS 714 -9.8613
## 2 HA 342 -6.9152
## 3 AA 32729 0.3556
## 4 DL 48110 1.6289
## 5 VX 5162 1.7487
## 6 US 20536 2.0565
## 7 UA 58665 3.5045
## 8 9E 18460 6.9135
## 9 B6 54635 9.3565
## 10 WN 12275 9.4675
## .. ... ... ...
con <- src_sqlite(path="~/Desktop/my_db.sqlite3", create = TRUE)
copy_to(con, flights)
## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
flights_sqlite <- tbl(con, "flights")
flights_sqlite
## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
We can manipulate the table flights_sqlite
with all the usual dplyr
verbs like select
, filter
, mutate
, summarise
etc.
select(flights_sqlite, year:day, dep_delay, arr_delay)
## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 5]
##
## year month day dep_delay arr_delay
## (int) (int) (int) (dbl) (dbl)
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## .. ... ... ... ... ...
mutate(flights_sqlite, speed = air_time / distance)
## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 17]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl), speed (dbl)
Find the mean departure time, dep_time
, of flights_sqlite
.
flights_sqlite %>% summarise( delay = mean(dep_time))
## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: <derived table> [?? x 1]
##
## delay
## (dbl)
## 1 1349.11
## .. ...
#or
as.data.frame(flights_sqlite)$dep_time %>% mean(na.rm=TRUE)
## Warning: Only first 100,000 results retrieved. Use n = -1 to retrieve all.
## [1] 1344.535
The only differnce between using these data verbs on a database and using them on a data frame is that the expressions in select(), filter(), arrange(), mutate(), and summarise() are translated into SQL so they can be run on the database.
When doing simple mathematical operations of the form you normally use when filtering, mutating and summarising it’s relatively straightforward to translate R code to SQL
To experiment with the translation, use translate_sql()
. The following examples work through some basic differences between R and SQL.
# In SQLite variable names are escaped by double quotes:
translate_sql(x)
## <SQL> "x"
#> <SQL> "x"
# And strings are escaped by single quotes
translate_sql("x")
## <SQL> 'x'
#> <SQL> 'x'
# Many functions have slightly different names
translate_sql(x == 1 && (y < 2 || z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
#> <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
#> <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
#> <SQL> "x" % 2.0 = 10.0
dplyr knows how to convert the following R functions to SQL:
basic math operators: +, -, *, /, %%, ^
math functions: abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
logical comparisons: <, <=, !=, >=, >, ==, %in%
boolean operations: &, &&, |, ||, !, xor
basic aggregations: mean, sum, min, max, sd, var