Source file ⇒ 2017-lec26.Rmd
## Announcements
The primary function to read files in R is readLines()
.
readLines()
is the workhorse function to read raw text in R as character strings.
A connection is an object that tells R to be prepared for opening a data source (eg file in local directory, or a file url). You can think of a connection as a pointer to the file in memory.
Connections can be used anywhere a file name can be passed to functions like readLines()
or writeLines()
.
Example:
con <- file("example")
open(con, "w")
writeLines("A line", con)
writeLines("Another line", con)
close(con)
readLines("example")
## [1] "A line" "Another line"
unlink("example") #deletes the file "example"
Connections provide a means to have more control over the way R will “comunicate” with the resources to be read (or written). When you write or read the file, the connection (i.e. pointer) keeps place of where you are in the file.
The above example requires a connection because we are appending to the file example
.
One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readLines
and writeLines
functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:
#Read the first 30 lines, 10 lines at a time
con <- file("~/Desktop/diamonds.json") #Data from "https://jeroenooms.github.io/data/diamonds.json"
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
close(con)
data1 %>% head(2)
## [1] "{\"carat\":0.23,\"cut\":\"Ideal\",\"color\":\"E\",\"clarity\":\"SI2\",\"depth\":61.5,\"table\":55,\"price\":326,\"x\":3.95,\"y\":3.98,\"z\":2.43}"
## [2] "{\"carat\":0.21,\"cut\":\"Premium\",\"color\":\"E\",\"clarity\":\"SI1\",\"depth\":59.8,\"table\":61,\"price\":326,\"x\":3.89,\"y\":3.84,\"z\":2.31}"
But much of the time you don’t need a connection and it is working behind the scenes when you read and write files.
Do example 1a
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-26-collection/
We use a connection when reading and writing to databases with sqlite
and other database programs.
One of the main drawbacks with R in working with big data is that all objects are stored in memory, so you can’t work directly with datasets that are more than 1-20Gb or so, depending on the memory on your machine.
UNIX operations are generally very fast, so if you can manipulate your data via UNIX commands and piping, that will allow you to do a lot. We have seen unix commands for extracting columns and various commands such as grep allows you to pick out rows based on certain criteria. So basic shell scripting may allow you to reduce your data to a more manageable size.
And don’t forget simple things. If you have a dataset with 30 columns that takes up 10 Gb but you only need 5 of the columns, get rid of the rest and work with the smaller dataset. Or you might be able to get the same information from a random sample of your large dataset as you would from doing the analysis on the full dataset. Strategies like this will often allow you to stick with the tools you already know.
Another stategy is to read in your data in pieces using a connection (as illustrated with the JSON example above) or to keep your data in DataBase and extract information from there. This is what we will discuss next.
A relational database stores data as a set of tables (or relations), which are rather similar to R data frames, in that a table is made up of columns or fields, each containing a single type (numeric, character, date, currency, …) and rows or records containing the observations for one entity. We have discussed DataBases in lecture 3.
You can interact with databases in a variety of database systems (DBMS=database manage- ment system) (some systems are SQLite, MySQL, postgreSQL, Oracle, Access). SQL is the Structured Query Language and is a special-purpose language for managing databases and making queries.
Many DBMS have a client-server model. Clients connect to the server, with some authentication, and make requests. We’ll concentrate here on a simple DBMS, SQLite, that allows us to just work on our local machine, with the database stored as a single file and with MySQL with a database I created on Amazon Web Services (AWS).
You will need to install the packages RSQLite
and DBI
in the console.
#install.packages("DBI")
#install.packages("RSQLite")
#install.packages("MySQL")
library(DBI)
library(RSQLite)
library(RMySQL)
##
## Attaching package: 'RMySQL'
## The following object is masked from 'package:RSQLite':
##
## isIdCurrent
Basic syntax:
# Once the RMySQL library is installed create a database connection object.
con <- dbConnect(RMySQL::MySQL(), dbname='database_name', host='host', user='user', password='password' )
#You can list tables in the database:
dbListTables(con)
#You can make tables:
dbWriteTable(con, name='table_name', value=data.frame.name)
#You can read tables into R:
dbReadTable(con, name='table_name')
For example:
#install.pack
library(RSQLite)
library(nycflights13)
#The driver, dvr, defines which type of database you’re connecting to (e.g. SQLite, MySQL, etc).
#con <- dbConnect(drv=RSQLite::SQLite(),"~/Desktop/adam_db") #form a database file on desktop
con <- dbConnect(drv=RSQLite::SQLite(),":memory:") #create a temporary in memory database
small_flights <- head(flights)
dbWriteTable(con, name="flights", value=small_flights, overwrite=TRUE ) #need overwrite=TRUE if table already exists
## [1] TRUE
dbReadTable(con,name="flights" ) %>% head()
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5
## 5 -25 DL 461 N668DN LGA ATL 116 762 6
## 6 12 UA 1696 N39463 EWR ORD 150 719 5
## minute time_hour
## 1 15 1357016400
## 2 29 1357016400
## 3 40 1357016400
## 4 45 1357016400
## 5 0 1357020000
## 6 58 1357016400
dbListTables(con,name="flights")
## [1] "flights"
nycflights <- dbReadTable(con, "flights")
head(nycflights)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5
## 5 -25 DL 461 N668DN LGA ATL 116 762 6
## 6 12 UA 1696 N39463 EWR ORD 150 719 5
## minute time_hour
## 1 15 1357016400
## 2 29 1357016400
## 3 40 1357016400
## 4 45 1357016400
## 5 0 1357020000
## 6 58 1357016400
We do some data wrangling: Group nycflights data by carrier, then summarise() with two variables: n_flights, the number of flights flown by each carrier and avg_delay, the average arrival delay of flights flown by each carrier. Finally, arrange the carriers by average delay from low to high.
nycflights %>%
group_by(carrier) %>%
summarise(n_flights=n(), avg_delay=mean(arr_delay,na.rm=TRUE)) %>%
arrange(avg_delay)
## # A tibble: 4 × 3
## carrier n_flights avg_delay
## <chr> <int> <dbl>
## 1 DL 1 -25.00000
## 2 B6 1 -18.00000
## 3 UA 3 14.33333
## 4 AA 1 33.00000
Do example 1b
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-26-collection/
Mysql allows you to read data in from a remote database. Here are instructions on how to use it: https://www.r-bloggers.com/accessing-mysql-through-r/
I created a free mySQL database in Amazon Web Services Cloud AmazonAWS
If you wish to create your own AWS account there are instructions here under Setting UP RDS ON AWS midway through the article.
library(RMySQL)
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "dplyr",
host = "dplyr.cbzo28gie3st.us-west-2.rds.amazonaws.com",
port = 3306,
user = "dplyr",
password = "adamdplyr")
dbWriteTable(conn=con, name='flights', value=as.data.frame(flights))
dbListTables(con)
nycflights <- dbReadTable(con, "flights") #this is quite slow
#head(nycflights)
So far we have been reading in tables into R from a database and then wrangling them in R. This is only possible if the data tables aren’t too big. More realistically we will want to wrangle the data in the database using Sructured Query Language (SQL).
The only differnce between using data verbs on a database and using them on a data frame is that the expressions in select(), filter(), arrange(), mutate(), and summarise() are translated into SQL so they can be run on the database.
small_flights <- head(flights)
small_planes <- head(planes)
con <- dbConnect(drv=RSQLite::SQLite(),":memory:") #create a temporary in memory database
dbWriteTable(con, name="flights", value=small_flights, overwrite=TRUE ) #need overwrite=TRUE if table already exists
## [1] TRUE
dbWriteTable(con, name="planes", value=small_planes, overwrite=TRUE ) #need overwrite=TRUE if table already exists
## [1] TRUE
db_list_tables(con)
## [1] "flights" "planes"
For queries, SQL has statements like:
SELECT var1, var2, var3 FROM tableX WHERE condition1 AND condition2 ORDER BY var4
Here condition1 might be dep_delay>5
and ORDER BY var4
is equivalent to arrange(dep_time)
.
flts <- dbSendQuery(con, "SELECT * FROM flights WHERE arr_time<1000")
The results of this query remain on the database. To access the results in R we need to use the fetch
function. This saves the results of the query as a data frame object.
#fetch(flts,-1) #n=-1 shows all records
fetch(flts,3)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## arr_delay carrier flight tailnum origin dest air_time distance hour
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5
## minute time_hour
## 1 15 1357016400
## 2 29 1357016400
## 3 40 1357016400
fetch(flts,3) #this gives the next 3 results. In this case there are only two more results in the data table.
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 554 600 -6 812 837
## 2 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour
## 1 -25 DL 461 N668DN LGA ATL 116 762 6
## 2 12 UA 1696 N39463 EWR ORD 150 719 5
## minute time_hour
## 1 0 1357020000
## 2 58 1357016400
When doing simple mathematical operations of the form you normally use when filtering, mutating and summarising it’s relatively straightforward to translate R code to SQL
To experiment with the translation, use translate_sql()
. The following examples work through some basic differences between R and SQL.
# In SQLite variable names are escaped by double quotes:
translate_sql(x)
## <SQL> "x"
#> <SQL> "x"
# And strings are escaped by single quotes
translate_sql("x")
## <SQL> 'x'
#> <SQL> 'x'
# Many functions have slightly different names
translate_sql(x == 1 && (y < 2 || z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
#> <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
#> <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
#> <SQL> "x" % 2.0 = 10.0
dplyr knows how to convert the following R functions to SQL:
basic math operators: +, -, *, /, %%, ^
math functions: abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
logical comparisons: <, <=, !=, >=, >, ==, %in%
boolean operations: &, &&, |, ||, !, xor
basic aggregations: mean, sum, min, max, sd, var