class: center, middle, inverse, title-slide # Unleashing SQL ## 🚀
with R ### Daniel Fryer ### NZSSN ### 2021/11/25 --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/c/c8/Podzia%C5%82_orbitera_na_podstawowe_elementy_konstrukcyjne.svg) background-size: contain ??? Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Sharingan_triple.svg) --- class: center, middle # The fundamentals --- # Why learn R? .pull-left[ Community! - Nerds - Statisticians - Data analysts - Social scientists - Psychologists - Bioinformaticians - Medical researchers - Programmers - Artists - Machine learners - Industry - ... State of the art statistical analysis **packages**. Free, open source, transparent. ] -- .pull-right[ Tons of help and inspiration: - [Twitter](https://twitter.com/rfunctionaday/) - [Blogs](https://mdneuzerling.com/post/my-data-science-job-hunt/) - [Educators](https://emitanaka.org/) - [Books](https://r4ds.had.co.nz/) - [Galleries](https://www.r-graph-gallery.com/) - [Journals](https://journal.r-project.org/) - [Conferences](https://user2021.r-project.org/) - [CRAN](https://cran.r-project.org/) Powerful and capable: - [Reproducible research!](https://ropensci.org/) - Make slides ([xaringan](https://bookdown.org/yihui/rmarkdown/xaringan.html)) - Automatic reports ([RMarkdown](https://bookdown.org/yihui/rmarkdown/)) - Dashboards ([Shiny](https://shiny.rstudio.com/)) - Interactive plots ([plotly](https://plotly.com/r/)) - [Used in industry](https://data-flair.training/blogs/r-applications/) **It's from New Zealand!** ] --- # Why learn R with SQL? Keep your data organised. -- Scaleable: potentially work with more data. -- Understand [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) better. -- Understand the [tidyverse](https://www.tidyverse.org/) better. -- Work with remote servers in large collaborations. -- Combine all your CSV files into a single file ([SQLite](https://www.sqlite.org/index.html)). -- Search and query your combined CSV files with ease ([DB Browser](https://sqlitebrowser.org/)). --- # Showing off ```r library(leaflet) leaflet() %>% addTiles() %>% setView(174.76898, -36.85231, zoom = 10) ```
--- # Project management 🍳 **Scripts** are essentially text files that you save your code to. -- Every **project** should have its own folder (usually with subfolders). -- A single project typically involves: - Preparation, analysis, presentation, data. -- The **working directory** should always be the **project root**. -- Your **workspace** is the RStudio environment, including all the variables you have created, and your 'history'. *Do not rely on this*. -- ''*The source code is real. The objects are realizations of the source code.*'' --- # Where do you put your data? -- .pull-left[ In the data folder! - Raw data - Processed data Always think about a new person opening your project directory for the first time. Will it work seamlessly? Will they know how to use it? *Restart your session frequently.* ] -- -- .pull-right[ ``` umbrella │ README.md │ start-here.R │ └───data │ │ │ └───raw │ └───processed │ └───prepare │ │ cleaning-functions.R │ └───analyse │ │ analysis-functions.R │ └───present │ └───figures └───unleashing-SQL ``` ] --- class: center, middle # Umbrella: a simple example project ## [Click here to open](https://github.com/frycast/umbrella) --- # R is a calculator ```r # a boring calculation 5 + 7 ``` ``` # [1] 12 ``` -- ##### A calculator that can save variables ```r x <- 5 + 7 x*2 ``` ``` ## [1] 24 ``` -- ##### In all sorts of ways ```r x <- c(1,2,3,4,5) x*2 ``` ``` ## [1] 2 4 6 8 10 ``` --- # R has lists These are very powerful ```r Friends <- list( id = c(1,2,3), FirstName = c('X','Y','Z'), LastName = c('A','B','C'), FavColour = c('red', 'blue', NA) ) Friends ``` ``` ## $id ## [1] 1 2 3 ## ## $FirstName ## [1] "X" "Y" "Z" ## ## $LastName ## [1] "A" "B" "C" ## ## $FavColour ## [1] "red" "blue" NA ``` -- *But the output looks kind of weird.* --- # Let's make the lists look better ```r Friends <- data.frame(Friends) Friends ``` ``` ## id FirstName LastName FavColour ## 1 1 X A red ## 2 2 Y B blue ## 3 3 Z C <NA> ``` -- 🤔 *Ummmm... can we do better?* -- ```r library(tibble) Friends <- tibble(Friends) Friends ``` ``` ## # A tibble: 3 x 4 ## id FirstName LastName FavColour ## <dbl> <chr> <chr> <chr> ## 1 1 X A red ## 2 2 Y B blue ## 3 3 Z C <NA> ``` 🧠 *Wow, that's informative!* --- # But *can we do better?* -- ```r library(DT) datatable(Friends) ```
--- # A better dataset
--- # 🦜 Yet underneath, 'tis still but a list 🦜 ```r str(Friends) ``` ``` ## tibble [3 x 4] (S3: tbl_df/tbl/data.frame) ## $ id : num [1:3] 1 2 3 ## $ FirstName: chr [1:3] "X" "Y" "Z" ## $ LastName : chr [1:3] "A" "B" "C" ## $ FavColour: chr [1:3] "red" "blue" NA ``` -- #### Lists, precious lists - 🤖 SQL stores data in a way that is great for machines - 👪 R stores data in a way that is great for programmers -- #### Everything is... Everything in R is either **data** or a **function**. We refer to 'things' as 'objects'. --- The contents of lists can be accessed with `$` ```r Friends$FavColour ``` ``` ## [1] "red" "blue" NA ``` So, what is `$`? -- Well, if `$` is not data, then it's a function. ```r `$`(Friends, FavColour) ``` ``` ## [1] "red" "blue" NA ``` Just a sneaky function. -- Here's another function for accessing things! `[` ```r Friends[1,4] ``` ``` ## # A tibble: 1 x 1 ## FavColour ## <chr> ## 1 red ``` --- # Thinking in vectors Here's a vector of `TRUE` / `FALSE` ```r x <- c(TRUE, FALSE, TRUE) ``` -- We can use it to get friends. ```r Friends[x,] ``` ``` ## # A tibble: 2 x 4 ## id FirstName LastName FavColour ## <dbl> <chr> <chr> <chr> ## 1 1 X A red ## 2 3 Z C <NA> ``` -- ```r Friends[!x,] ``` ``` ## # A tibble: 1 x 4 ## id FirstName LastName FavColour ## <dbl> <chr> <chr> <chr> ## 1 2 Y B blue ``` --- # Packages make functions and data But we can (and should) make our own functions. -- ```r # A function that checks if x is 'blue' is_x_blue <- function(x) { x == 'blue' } ``` -- And we should use them often. ```r is_x_blue('green') ``` ``` ## [1] FALSE ``` ```r is_x_blue('blue') ``` ``` ## [1] TRUE ``` -- We can use the argument explicitly, if we like. ```r is_x_blue(x='red') ``` ``` ## [1] FALSE ``` --- # Functions are building blocks. The longer code gets, the harder it is to think about. -- #### Do you prefer this? ```r spec <- iris$Species iris_setosa <- iris[spec == "setosa", ] m <- lm(Sepal.Length ~ Petal.Length, data=iris_setosa) coef(summary(m)) ``` -- #### Or this? ```r # Get the iris Setosa species iris_setosa <- where_species_is_setosa(iris) # Fit Model (1), and get results results <- fit_linear_model(iris_setosa, number = 1) # Print results results ``` --- # The pipe operator `%>%` ```r # Get the iris Setosa species iris_setosa <- where_species_is_setosa(iris) # Fit Model (1), and get results results <- fit_linear_model(iris_setosa, number = 1) # Print results results ``` -- Many people prefer this instead: ```r # Get Setosa species and fit Model (1) iris %>% where_species_is_setosa() %>% fit_linear_model(number = 1) ``` -- The pipe comes from 📦`magrittr` ```r library(magrittr) ``` --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tidyverse.svg) background-size: contain --- class: center, middle # Tidy data --- class: center, middle ''*Happy families are all alike;*</br> *every unhappy family is unhappy in its own way*''</br> \- Leo Tolstoy. -- ''*Tidy datasets are all alike,*</br> *but every messy dataset is messy in its own way*''</br> \- Hadley Wickham. --- class: center, middle ''*The principles of tidy data are closely tied to those of relational databases... but are framed in a language familiar to statisticians*'' [1] .footnote[ [1] [Wickham, H (2014), *Tidy Data*, The R Journal](https://vita.had.co.nz/papers/tidy-data.pdf) ] --- # Tidy data principles 1. Every variable is a column. 1. Every observation is a row. 1. Every cell holds a single (atomic) value. -- Conditions 1 and 2 are often referred to as **long format**. -- ### Advantages - Datasets can be read and understood universally. - Easier for R packages to work together. - Enables SQL-like operations. --- # 🕵️ Messy data The R package 📦[`tidyr`](https://tidyr.tidyverse.org/) has functions for tidying messy data. -- And it also has some datasets for us to play with. ```r library(tidyr) relig_income ```
--- # 🧹 Cleaning `relig_income` This kind of dataset is sometimes referred to as **wide format**. 📦`tidyr` gives us [`pivot_longer`](https://tidyr.tidyverse.org/reference/pivot_longer.html) -- ```r pivot_longer( data = relig_income, cols = !religion, names_to = "income", values_to = "count" ) ```
--- # 🔬 Messy data no.2: Anscombe's Quartet <!-- --> --- # Anscombe's Quartet
--- # 🧹 Cleaning Anscombe's Quartet ```r pivot_longer( data = anscombe, cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)" ) ```
--- background-image: url(https://github.com/yihui/xaringan/releases/download/v0.0.2/karl-moustache.jpg) --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/d/d5/P_satellite_dish.svg) background-size: contain --- class: center, middle # Getting connected to SQL --- class: center, middle *For MySQL and T-SQL, see* </br> *[this guide](https://htmlpreview.github.io/?https://github.com/frycast/SQL_course/blob/master/R/connecting-R/databases-in-R.html)*</br> *covering local and remote connections from R* --- # The wonderful [SQLite](https://www.sqlite.org/index.html) SQLite is: - Small - Fast - Self contained - High reliability - Full-featured -- The most used database engine in the world. -- Around one trillion active databases in use. -- Hundreds of SQLite databases on any smart phone. -- One of the top 5 most widely deployed software modules of any kind. --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/3/38/SQLite370.svg) background-size: contain --- # Connect or create, one line! First get our directories organised (see [umbrella](https://github.com/frycast/umbrella)). ```r library(here) ``` -- Now, create or connect ```r library(RSQLite) con <- dbConnect( SQLite(), here("data", "raw", "Sandpit.sqlite") ) ``` *If it can't be found, then* [`Sandpit.sqlite`](https://github.com/frycast/SQL_course/tree/master/R/sqlite-R) *will be created (empty).* --- # Use the Sandpit database The package 📦[`dplyr`](https://dplyr.tidyverse.org/) gives us a whole 'grammar of data manipulation'. The package 📦[`dbplyr`](https://dbplyr.tidyverse.org/) allows `dplyr` to talk to SQL. ```r library(dplyr) library(dbplyr) ``` -- **So many packages!** From now on I'll write them like this: ```r dplyr::tbl() ``` The `dplyr::` means we are using a function from `dplyr`. The function we're using is called `tbl`. --- # 🍌 Connect to a table ```r banana <- dplyr::tbl(con, "Ape_Banana") banana ``` ``` ## # Source: table<Ape_Banana> [?? x 7] ## # Database: sqlite 3.36.0 ## # [D:\CloudDrive\OneDrive\Cloud-drive\Teach\Courses\Workshops\SQL\repositories\umbrella\data\raw\Sandpit.sqlite] ## BananaID TasteRank DatePicked DateEaten Ripe TreeID Comments ## <int> <int> <int> <int> <int> <int> <chr> ## 1 1 2 20181003 20181004 0 1 <NA> ## 2 2 4 20181003 20181004 1 2 <NA> ## 3 3 4 20181003 20181004 1 2 <NA> ## 4 4 5 20181003 20181006 1 1 <NA> ## 5 5 5 20181003 20181006 1 2 best banana ever ## 6 6 3 20181003 20181004 1 2 <NA> ## 7 7 2 20181002 20181004 0 3 <NA> ## 8 8 5 20181002 20181005 1 3 smooth and delectable ## 9 9 3 20181002 20181003 1 4 <NA> ## 10 10 3 20181002 20181003 1 5 <NA> ## # ... with more rows ``` --- # Grammar of data manipulation 💘 SQL The function `dplyr::filter` is like the SQL `WHERE` clause. -- ```r ripe_banana <- banana %>% dplyr::filter(Ripe == 1) ripe_banana ``` ``` ## # Source: lazy query [?? x 7] ## # Database: sqlite 3.36.0 ## # [D:\CloudDrive\OneDrive\Cloud-drive\Teach\Courses\Workshops\SQL\repositories\umbrella\data\raw\Sandpit.sqlite] ## BananaID TasteRank DatePicked DateEaten Ripe TreeID Comments ## <int> <int> <int> <int> <int> <int> <chr> ## 1 2 4 20181003 20181004 1 2 <NA> ## 2 3 4 20181003 20181004 1 2 <NA> ## 3 4 5 20181003 20181006 1 1 <NA> ## 4 5 5 20181003 20181006 1 2 best banana ever ## 5 6 3 20181003 20181004 1 2 <NA> ## 6 8 5 20181002 20181005 1 3 smooth and delectable ## 7 9 3 20181002 20181003 1 4 <NA> ## 8 10 3 20181002 20181003 1 5 <NA> ## 9 12 5 20181002 20181005 1 4 <NA> ## 10 16 5 20181001 20181004 1 5 a culinary delight ## # ... with more rows ``` --- # Grammar of data manipulation 💘 SQL But seriously, `dplyr::filter` is *actually* the `WHERE` clause. ```r ripe_banana %>% dplyr::show_query() ``` ``` ## <SQL> ## SELECT * ## FROM `Ape_Banana` ## WHERE (`Ripe` = 1.0) ``` -- To execute the query: ```r ripe_banana %>% dplyr::collect() ``` ``` ## # A tibble: 34 x 7 ## BananaID TasteRank DatePicked DateEaten Ripe TreeID Comments ## <int> <int> <int> <int> <int> <int> <chr> ## 1 2 4 20181003 20181004 1 2 <NA> ## 2 3 4 20181003 20181004 1 2 <NA> ## 3 4 5 20181003 20181006 1 1 <NA> ## 4 5 5 20181003 20181006 1 2 best banana ever ## 5 6 3 20181003 20181004 1 2 <NA> ## 6 8 5 20181002 20181005 1 3 smooth and delectable ## 7 9 3 20181002 20181003 1 4 <NA> ## 8 10 3 20181002 20181003 1 5 <NA> ## 9 12 5 20181002 20181005 1 4 <NA> ## 10 16 5 20181001 20181004 1 5 a culinary delight ## # ... with 24 more rows ``` --- # ✍️ We can write our own SQL ```r DBI::dbGetQuery(con, " SELECT TreeID, COUNT(*) AS NumRipe, AVG(TasteRank) AS AvgTaste FROM Ape_Banana WHERE DatePicked > '20180101' GROUP BY TreeID; " ) ```
--- # It's up to you ```r tasty_bananas <- banana %>% dplyr::filter(DatePicked > '20180101') %>% dplyr::group_by(TreeID) %>% dplyr::summarise(NumRipe = n(), AvgTaste = mean(TasteRank, na.rm=T)) ``` -- We can always inspect the SQL code created by `dplyr`. ```r tasty_bananas %>% show_query() ``` ``` ## <SQL> ## SELECT `TreeID`, COUNT(*) AS `NumRipe`, AVG(`TasteRank`) AS `AvgTaste` ## FROM `Ape_Banana` ## WHERE (`DatePicked` > '20180101') ## GROUP BY `TreeID` ``` --- # Advanced: programmatically edit queries ```r for (this_taste in c(3,4,5)) { res <- DBI::dbGetQuery(con, stringr::str_interp(" SELECT * FROM Ape_Banana WHERE TasteRank = ${this_taste} ")) cat("\nResults for taste = ", this_taste, "\n") print(nrow(res)) } ``` ``` ## ## Results for taste = 3 ## [1] 7 ## ## Results for taste = 4 ## [1] 10 ## ## Results for taste = 5 ## [1] 18 ``` --- # Advanced no.2: batch process results Send the query without executing it. ```r rs <- DBI::dbSendQuery(con, " SELECT * FROM Ape_Banana ") ``` Process results, 20 at a time ```r while (!DBI::dbHasCompleted(rs)) { twenty_bananas <- DBI::dbFetch(rs, n = 20) # << insert processing on twenty_bananas here >> print(nrow(twenty_bananas)) } ``` ``` ## [1] 20 ## [1] 20 ## [1] 10 ``` --- # Saving results Collect the results ```r tasty_bananas <- dplyr::collect(tasty_bananas) ``` Save as a CSV (📦[`readr`](https://readr.tidyverse.org/) is better at it). ```r library(readr) # Choose the processed data directory location <- here("data", "processed", "tasty_bananas.csv") # Write CSV readr::write_csv(tasty_bananas, location) ``` --- # Saving results Or save to SQLite. ```r # Choose the processed data directory location <- here("data", "processed", "Sandpit_results.sqlite") # Connect or create database res_con <- DBI::dbConnect(RSQLite::SQLite(), location) # Save the table DBI::dbWriteTable(res_con, "tasty_bananas", tasty_bananas) ``` --- # Don't forget to disconnect When you're done. ```r DBI::dbDisconnect(con) DBI::dbDisconnect(res_con) ``` --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/7/74/Space_Exploration_Vehicle.svg) background-size: contain --- class: center, middle # Live guide and demonstrations --- # Live guide and demonstrations - Using GitHub - Downloading [umbrella](https://github.com/frycast/umbrella) repository - Using [DB Browser](https://sqlitebrowser.org/) with SQLite - Live demo of connecting and exploring Also, see the [R code folder](https://github.com/frycast/SQL_course/tree/master/R/sqlite-R) on the course repo. ---