2017-lec26

Source file ⇒ 2017-lec26.Rmd
## Announcements

Tuesday is last class!
Are there any Stat 134 students also taking Stat 133 this semester?

Today

Connections
Handling big data
Databases

1. Connections

The primary function to read files in R is readLines().

readLines() is the workhorse function to read raw text in R as character strings.

A connection is an object that tells R to be prepared for opening a data source (eg file in local directory, or a file url). You can think of a connection as a pointer to the file in memory.

Connections can be used anywhere a file name can be passed to functions like readLines() or writeLines().

Example:

con <- file("example")
open(con, "w")
writeLines("A line", con)
writeLines("Another line", con)
close(con)
readLines("example")

## [1] "A line"       "Another line"

unlink("example") #deletes the file "example"

Connections provide a means to have more control over the way R will “comunicate” with the resources to be read (or written). When you write or read the file, the connection (i.e. pointer) keeps place of where you are in the file.

The above example requires a connection because we are appending to the file example.

One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readLines and writeLines functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:

#Read the first 30 lines, 10 lines at a time
con <- file("~/Desktop/diamonds.json")   #Data from "https://jeroenooms.github.io/data/diamonds.json"
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
close(con)
data1 %>% head(2)

## [1] "{\"carat\":0.23,\"cut\":\"Ideal\",\"color\":\"E\",\"clarity\":\"SI2\",\"depth\":61.5,\"table\":55,\"price\":326,\"x\":3.95,\"y\":3.98,\"z\":2.43}"  
## [2] "{\"carat\":0.21,\"cut\":\"Premium\",\"color\":\"E\",\"clarity\":\"SI1\",\"depth\":59.8,\"table\":61,\"price\":326,\"x\":3.89,\"y\":3.84,\"z\":2.31}"

But much of the time you don’t need a connection and it is working behind the scenes when you read and write files.

In Class exercise

Do example 1a

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-26-collection/

We use a connection when reading and writing to databases with sqlite and other database programs.

2 Handling Big Data in R

One of the main drawbacks with R in working with big data is that all objects are stored in memory, so you can’t work directly with datasets that are more than 1-20Gb or so, depending on the memory on your machine.

UNIX operations are generally very fast, so if you can manipulate your data via UNIX commands and piping, that will allow you to do a lot. We have seen unix commands for extracting columns and various commands such as grep allows you to pick out rows based on certain criteria. So basic shell scripting may allow you to reduce your data to a more manageable size.

And don’t forget simple things. If you have a dataset with 30 columns that takes up 10 Gb but you only need 5 of the columns, get rid of the rest and work with the smaller dataset. Or you might be able to get the same information from a random sample of your large dataset as you would from doing the analysis on the full dataset. Strategies like this will often allow you to stick with the tools you already know.

Another stategy is to read in your data in pieces using a connection (as illustrated with the JSON example above) or to keep your data in DataBase and extract information from there. This is what we will discuss next.

3 DataBases

A relational database stores data as a set of tables (or relations), which are rather similar to R data frames, in that a table is made up of columns or fields, each containing a single type (numeric, character, date, currency, …) and rows or records containing the observations for one entity. We have discussed DataBases in lecture 3.

You can interact with databases in a variety of database systems (DBMS=database manage- ment system) (some systems are SQLite, MySQL, postgreSQL, Oracle, Access). SQL is the Structured Query Language and is a special-purpose language for managing databases and making queries.

Many DBMS have a client-server model. Clients connect to the server, with some authentication, and make requests. We’ll concentrate here on a simple DBMS, SQLite, that allows us to just work on our local machine, with the database stored as a single file and with MySQL with a database I created on Amazon Web Services (AWS).

You will need to install the packages RSQLite and DBI in the console.

#install.packages("DBI")
#install.packages("RSQLite")
#install.packages("MySQL")
library(DBI)
library(RSQLite)
library(RMySQL)

## 
## Attaching package: 'RMySQL'

## The following object is masked from 'package:RSQLite':
## 
##     isIdCurrent

Basic syntax:

# Once the RMySQL library is installed create a database connection object.
con <- dbConnect(RMySQL::MySQL(), dbname='database_name', host='host', user='user', password='password' )

#You can list tables in the database:
dbListTables(con)

#You can make tables:
dbWriteTable(con, name='table_name', value=data.frame.name)

#You can read tables into R:
dbReadTable(con, name='table_name')

For example:

RSQLite: nycflights13 example:

#install.pack
library(RSQLite)
library(nycflights13)

#The driver, dvr, defines which type of database you’re connecting to (e.g. SQLite, MySQL, etc).

#con <- dbConnect(drv=RSQLite::SQLite(),"~/Desktop/adam_db")  #form a database file on desktop
con <- dbConnect(drv=RSQLite::SQLite(),":memory:")  #create a temporary in memory database

small_flights <- head(flights)
dbWriteTable(con, name="flights", value=small_flights, overwrite=TRUE ) #need overwrite=TRUE if table already exists

## [1] TRUE

dbReadTable(con,name="flights" ) %>% head()

##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute  time_hour
## 1     15 1357016400
## 2     29 1357016400
## 3     40 1357016400
## 4     45 1357016400
## 5      0 1357020000
## 6     58 1357016400

dbListTables(con,name="flights")

## [1] "flights"

nycflights <- dbReadTable(con, "flights")
head(nycflights)

##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute  time_hour
## 1     15 1357016400
## 2     29 1357016400
## 3     40 1357016400
## 4     45 1357016400
## 5      0 1357020000
## 6     58 1357016400

We do some data wrangling: Group nycflights data by carrier, then summarise() with two variables: n_flights, the number of flights flown by each carrier and avg_delay, the average arrival delay of flights flown by each carrier. Finally, arrange the carriers by average delay from low to high.

nycflights %>% 
  group_by(carrier) %>% 
  summarise(n_flights=n(), avg_delay=mean(arr_delay,na.rm=TRUE)) %>% 
  arrange(avg_delay)

## # A tibble: 4 × 3
##   carrier n_flights avg_delay
##     <chr>     <int>     <dbl>
## 1      DL         1 -25.00000
## 2      B6         1 -18.00000
## 3      UA         3  14.33333
## 4      AA         1  33.00000

In Class exercise

Do example 1b

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-26-collection/

Mysql: example

Mysql allows you to read data in from a remote database. Here are instructions on how to use it: https://www.r-bloggers.com/accessing-mysql-through-r/

I created a free mySQL database in Amazon Web Services Cloud AmazonAWS

If you wish to create your own AWS account there are instructions here under Setting UP RDS ON AWS midway through the article.

library(RMySQL)
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                  dbname = "dplyr", 
                  host = "dplyr.cbzo28gie3st.us-west-2.rds.amazonaws.com", 
                  port = 3306, 
                  user = "dplyr",
                  password = "adamdplyr")
                  
dbWriteTable(conn=con, name='flights', value=as.data.frame(flights))                  
dbListTables(con)
nycflights <- dbReadTable(con, "flights")  #this is quite slow
#head(nycflights)

Queries of databases using SQL

So far we have been reading in tables into R from a database and then wrangling them in R. This is only possible if the data tables aren’t too big. More realistically we will want to wrangle the data in the database using Sructured Query Language (SQL).

The only differnce between using data verbs on a database and using them on a data frame is that the expressions in select(), filter(), arrange(), mutate(), and summarise() are translated into SQL so they can be run on the database.

small_flights <- head(flights)
small_planes <- head(planes)
con <- dbConnect(drv=RSQLite::SQLite(),":memory:")  #create a temporary in memory database
dbWriteTable(con, name="flights", value=small_flights, overwrite=TRUE ) #need overwrite=TRUE if table already exists

## [1] TRUE

dbWriteTable(con, name="planes", value=small_planes, overwrite=TRUE ) #need overwrite=TRUE if table already exists

## [1] TRUE

db_list_tables(con)

## [1] "flights" "planes"

For queries, SQL has statements like:

SELECT var1, var2, var3 FROM tableX WHERE condition1 AND condition2 ORDER BY var4

Here condition1 might be dep_delay>5 and ORDER BY var4 is equivalent to arrange(dep_time).

flts <- dbSendQuery(con, "SELECT * FROM flights WHERE arr_time<1000")

The results of this query remain on the database. To access the results in R we need to use the fetch function. This saves the results of the query as a data frame object.

#fetch(flts,-1) #n=-1 shows all records
fetch(flts,3)

##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
##   minute  time_hour
## 1     15 1357016400
## 2     29 1357016400
## 3     40 1357016400

fetch(flts,3) #this gives the next 3 results.  In this case there are only two more results in the data table.

##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      554            600        -6      812            837
## 2 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1       -25      DL    461  N668DN    LGA  ATL      116      762    6
## 2        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute  time_hour
## 1      0 1357020000
## 2     58 1357016400

SQL translation

When doing simple mathematical operations of the form you normally use when filtering, mutating and summarising it’s relatively straightforward to translate R code to SQL

To experiment with the translation, use translate_sql(). The following examples work through some basic differences between R and SQL.

# In SQLite variable names are escaped by double quotes:
translate_sql(x)

## <SQL> "x"

#> <SQL> "x"
# And strings are escaped by single quotes
translate_sql("x")

## <SQL> 'x'

#> <SQL> 'x'

# Many functions have slightly different names
translate_sql(x == 1 && (y < 2 || z > 3))

## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)

#> <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)

## <SQL> POWER("x", 2.0) < 10.0

#> <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)

## <SQL> "x" % 2.0 = 10.0

#> <SQL> "x" % 2.0 = 10.0

dplyr knows how to convert the following R functions to SQL:

basic math operators: +, -, *, /, %%, ^
math functions: abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
logical comparisons: <, <=, !=, >=, >, ==, %in%
boolean operations: &, &&, |, ||, !, xor
basic aggregations: mean, sum, min, max, sd, var