Unleashing SQL

# Unleashing SQL
## 🚀<br /><br />with R
### Daniel Fryer
### NZSSN
### 2021/11/25

---

background-image: url(https://upload.wikimedia.org/wikipedia/commons/c/c8/Podzia%C5%82_orbitera_na_podstawowe_elementy_konstrukcyjne.svg)

background-size: contain

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Sharingan_triple.svg)

---

# The fundamentals

---

# Why learn R?

Community!

- Nerds
- Statisticians
- Data analysts
- Social scientists
- Psychologists
- Bioinformaticians
- Medical researchers
- Programmers
- Artists
- Machine learners
- Industry
- ...

State of the art statistical analysis **packages**.

Free, open source, transparent.
]

--
.pull-right[

Tons of help and inspiration:

- [Twitter](https://twitter.com/rfunctionaday/)
- [Blogs](https://mdneuzerling.com/post/my-data-science-job-hunt/)
- [Educators](https://emitanaka.org/)
- [Books](https://r4ds.had.co.nz/)
- [Galleries](https://www.r-graph-gallery.com/)
- [Journals](https://journal.r-project.org/)
- [Conferences](https://user2021.r-project.org/)
- [CRAN](https://cran.r-project.org/)

Powerful and capable:

- [Reproducible research!](https://ropensci.org/)
- Make slides ([xaringan](https://bookdown.org/yihui/rmarkdown/xaringan.html))
- Automatic reports ([RMarkdown](https://bookdown.org/yihui/rmarkdown/))
- Dashboards ([Shiny](https://shiny.rstudio.com/))
- Interactive plots ([plotly](https://plotly.com/r/))
- [Used in industry](https://data-flair.training/blogs/r-applications/)

**It's from New Zealand!**

]

---

# Why learn R with SQL?

Keep your data organised.

Scaleable: potentially work with more data.

Understand [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) better.

Understand the [tidyverse](https://www.tidyverse.org/) better.

Work with remote servers in large collaborations.

Combine all your CSV files into a single file ([SQLite](https://www.sqlite.org/index.html)).

Search and query your combined CSV files with ease ([DB Browser](https://sqlitebrowser.org/)).

---

# Showing off

```r
library(leaflet)
leaflet() %>% addTiles() %>% setView(174.76898, -36.85231, zoom = 10)
```

<div id="htmlwidget-851df706b51ab722467c" style="width:100%;height:432px;" class="leaflet html-widget"></div>
<script type="application/json" data-for="htmlwidget-851df706b51ab722467c">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["//{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"© <a href=\"http://openstreetmap.org\">OpenStreetMap<\/a> contributors, <a href=\"http://creativecommons.org/licenses/by-sa/2.0/\">CC-BY-SA<\/a>"}]}],"setView":[[-36.85231,174.76898],10,[]]},"evals":[],"jsHooks":[]}</script>

---

# Project management 🍳

**Scripts** are essentially text files that you save your code to.

Every **project** should have its own folder (usually with subfolders).

A single project typically involves:

- Preparation, analysis, presentation, data.

The **working directory** should always be the **project root**.

Your **workspace** is the RStudio environment, including all the variables you have created, and your 'history'. *Do not rely on this*.

''*The source code is real. The objects are realizations of the source code.*''

---

# Where do you put your data?

--
.pull-left[
In the data folder!

- Raw data
- Processed data

Always think about a new person opening your project directory for the first time. Will it work seamlessly? Will they know how to use it?

*Restart your session frequently.*
]

.pull-right[
```
umbrella
│   README.md
│   start-here.R
│
└───data
│   │
│   └───raw
│   └───processed
│   
└───prepare
│   │   cleaning-functions.R
│
└───analyse
│   │   analysis-functions.R
│
└───present
    │
    └───figures
    └───unleashing-SQL
```
]

---

# Umbrella: a simple example project

## [Click here to open](https://github.com/frycast/umbrella)

---

# R is a calculator

```r
# a boring calculation
5 + 7
```

```
# [1] 12
```

##### A calculator that can save variables

```r
x <- 5 + 7
x*2
```

```
## [1] 24
```

##### In all sorts of ways

```r
x <- c(1,2,3,4,5)
x*2
```

```
## [1]  2  4  6  8 10
```

---

# R has lists

These are very powerful

```r
Friends <- list(
  id = c(1,2,3), 
  FirstName = c('X','Y','Z'),
  LastName = c('A','B','C'),
  FavColour = c('red', 'blue', NA)
)
Friends
```

```
## $id
## [1] 1 2 3
## 
## $FirstName
## [1] "X" "Y" "Z"
## 
## $LastName
## [1] "A" "B" "C"
## 
## $FavColour
## [1] "red"  "blue" NA
```

*But the output looks kind of weird.*

---

# Let's make the lists look better

```r
Friends <- data.frame(Friends)
Friends
```

```
##   id FirstName LastName FavColour
## 1  1         X        A       red
## 2  2         Y        B      blue
## 3  3         Z        C      <NA>
```

🤔 *Ummmm... can we do better?*

```r
library(tibble)
Friends <- tibble(Friends)
Friends
```

```
## # A tibble: 3 x 4
##      id FirstName LastName FavColour
##   <dbl> <chr>     <chr>    <chr>    
## 1     1 X         A        red      
## 2     2 Y         B        blue     
## 3     3 Z         C        <NA>
```

🧠 *Wow, that's informative!*

---

# But *can we do better?*

```r
library(DT)
datatable(Friends)
```

<div id="htmlwidget-250d805fada7f8c381a4" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-250d805fada7f8c381a4">{"x":{"filter":"none","vertical":false,"data":[["1","2","3"],[1,2,3],["X","Y","Z"],["A","B","C"],["red","blue",null]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>id<\/th>\n      <th>FirstName<\/th>\n      <th>LastName<\/th>\n      <th>FavColour<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":1},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---

# A better dataset

<div id="htmlwidget-12e721623b6b8c8b93f9" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-12e721623b6b8c8b93f9">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["9","12","15","16","17","20","23","25","29","30","31","32","35","37","41","43","46","50","52","54","56","65","67","68","71","72","78","81","86","87","88","89","93","96","102","104","105","109","112","117","118","119","128","132","135","138","142","144","146","150"],[4.4,4.8,5.8,5.7,5.4,5.1,4.6,4.8,5.2,4.7,4.8,5.4,4.9,5.5,5,4.4,4.8,5,6.4,5.5,5.7,5.6,5.6,5.8,5.9,6.1,6.7,5.5,6,6.7,6.3,5.6,5.8,5.7,5.8,6.3,6.5,6.7,6.4,6.5,7.7,7.7,6.1,7.9,6.1,6.4,6.9,6.8,6.7,5.9],[2.9,3.4,4,4.4,3.9,3.8,3.6,3.4,3.4,3.2,3.1,3.4,3.1,3.5,3.5,3.2,3,3.3,3.2,2.3,2.8,2.9,3,2.7,3.2,2.8,3,2.4,3.4,3.1,2.3,3,2.6,3,2.7,2.9,3,2.5,2.7,3,3.8,2.6,3,3.8,2.6,3.1,3.1,3.2,3,3],[1.4,1.6,1.2,1.5,1.3,1.5,1,1.9,1.4,1.6,1.6,1.5,1.5,1.3,1.3,1.3,1.4,1.4,4.5,4,4.5,3.6,4.5,4.1,4.8,4,5,3.8,4.5,4.7,4.4,4.1,4,4.2,5.1,5.6,5.8,5.8,5.3,5.5,6.7,6.9,4.9,6.4,5.6,5.5,5.1,5.9,5.2,5.1],[0.2,0.2,0.2,0.4,0.4,0.3,0.2,0.2,0.2,0.2,0.2,0.4,0.2,0.2,0.3,0.2,0.3,0.2,1.5,1.3,1.3,1.3,1.5,1,1.8,1.3,1.7,1.1,1.6,1.5,1.3,1.3,1.2,1.2,1.9,1.8,2.2,1.8,1.9,1.8,2.2,2.3,1.8,2,1.4,1.8,2.3,2.3,2.3,1.8],["setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Sepal.Length<\/th>\n      <th>Sepal.Width<\/th>\n      <th>Petal.Length<\/th>\n      <th>Petal.Width<\/th>\n      <th>Species<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":8,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

# 🦜 Yet underneath, 'tis still but a list 🦜

```r
str(Friends)
```

```
## tibble [3 x 4] (S3: tbl_df/tbl/data.frame)
##  $ id       : num [1:3] 1 2 3
##  $ FirstName: chr [1:3] "X" "Y" "Z"
##  $ LastName : chr [1:3] "A" "B" "C"
##  $ FavColour: chr [1:3] "red" "blue" NA
```

#### Lists, precious lists

- 🤖 SQL stores data in a way that is great for machines 
- 👪 R stores data in a way that is great for programmers

#### Everything is...

Everything in R is either **data** or a **function**. We refer to 'things' as 'objects'.

---

The contents of lists can be accessed with `$`

```r
Friends$FavColour
```

```
## [1] "red"  "blue" NA
```

So, what is `$`?

Well, if `$` is not data, then it's a function.

```r
`$`(Friends, FavColour)
```

```
## [1] "red"  "blue" NA
```

Just a sneaky function.

Here's another function for accessing things! `[`

```r
Friends[1,4]
```

```
## # A tibble: 1 x 1
##   FavColour
##   <chr>    
## 1 red
```

---

# Thinking in vectors

Here's a vector of `TRUE` / `FALSE`

```r
x <- c(TRUE, FALSE, TRUE)
```

We can use it to get friends.

```r
Friends[x,]
```

```
## # A tibble: 2 x 4
##      id FirstName LastName FavColour
##   <dbl> <chr>     <chr>    <chr>    
## 1     1 X         A        red      
## 2     3 Z         C        <NA>
```

```r
Friends[!x,]
```

```
## # A tibble: 1 x 4
##      id FirstName LastName FavColour
##   <dbl> <chr>     <chr>    <chr>    
## 1     2 Y         B        blue
```

---

# Packages make functions and data

But we can (and should) make our own functions.

```r
# A function that checks if x is 'blue'
is_x_blue <- function(x) {
  x == 'blue'
}
```

And we should use them often.

```r
is_x_blue('green')
```

```
## [1] FALSE
```

```r
is_x_blue('blue')
```

```
## [1] TRUE
```

We can use the argument explicitly, if we like.

```r
is_x_blue(x='red')
```

```
## [1] FALSE
```

---

# Functions are building blocks.

The longer code gets, the harder it is to think about.

#### Do you prefer this?

```r
spec <- iris$Species
iris_setosa <- iris[spec == "setosa", ]
m <- lm(Sepal.Length ~ Petal.Length, data=iris_setosa)
coef(summary(m))
```

#### Or this?

```r
# Get the iris Setosa species
iris_setosa <- where_species_is_setosa(iris)

# Fit Model (1), and get results
results <- fit_linear_model(iris_setosa, number = 1)

# Print results
results
```

---

# The pipe operator `%>%`

```r
# Get the iris Setosa species
iris_setosa <- where_species_is_setosa(iris)

# Fit Model (1), and get results
results <- fit_linear_model(iris_setosa, number = 1)

# Print results
results
```

Many people prefer this instead:

```r
# Get Setosa species and fit Model (1)
iris %>% 
  where_species_is_setosa() %>%
  fit_linear_model(number = 1)
```

The pipe comes from 📦`magrittr`

```r
library(magrittr)
```

---

background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tidyverse.svg)

background-size: contain

---

# Tidy data

---

''*Happy families are all alike;*</br>
*every unhappy family is unhappy in its own way*''</br>
\- Leo Tolstoy.

''*Tidy datasets are all alike,*</br>
*but every messy dataset is messy in its own way*''</br>
\- Hadley Wickham.

---

''*The principles of tidy data are closely tied to those of relational databases... but are framed in a language familiar to statisticians*'' [1]

.footnote[
[1] [Wickham, H (2014), *Tidy Data*, The R Journal](https://vita.had.co.nz/papers/tidy-data.pdf)
]
---

# Tidy data principles

1. Every variable is a column.

1. Every observation is a row.

1. Every cell holds a single (atomic) value.

Conditions 1 and 2 are often referred to as **long format**.

### Advantages

- Datasets can be read and understood universally.

- Easier for R packages to work together.

- Enables SQL-like operations.

---

# 🕵️ Messy data

The R package 📦[`tidyr`](https://tidyr.tidyverse.org/) has functions for tidying messy data.

And it also has some datasets for us to play with.

```r
library(tidyr)
relig_income
```

<div id="htmlwidget-22ff522e3d67f33a812c" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-22ff522e3d67f33a812c">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18"],["Agnostic","Atheist","Buddhist","Catholic","Don’t know/refused","Evangelical Prot","Hindu","Historically Black Prot","Jehovah's Witness","Jewish","Mainline Prot","Mormon","Muslim","Orthodox","Other Christian","Other Faiths","Other World Religions","Unaffiliated"],[27,12,27,418,15,575,1,228,20,19,289,29,6,13,9,20,5,217],[34,27,21,617,14,869,9,244,27,19,495,40,7,17,7,33,2,299],[60,37,30,732,15,1064,7,236,24,25,619,48,9,23,11,40,3,374],[81,52,34,670,11,982,9,238,24,25,655,51,10,32,13,46,4,365],[76,35,33,638,10,881,11,197,21,30,651,56,9,32,13,49,2,341]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>religion<\/th>\n      <th><$10k<\/th>\n      <th>$10-20k<\/th>\n      <th>$20-30k<\/th>\n      <th>$30-40k<\/th>\n      <th>$40-50k<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":3,"columnDefs":[{"className":"dt-right","targets":[2,3,4,5,6]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[3,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

# 🧹 Cleaning `relig_income`

This kind of dataset is sometimes referred to as **wide format**.

📦`tidyr` gives us [`pivot_longer`](https://tidyr.tidyverse.org/reference/pivot_longer.html)

```r
pivot_longer(
  data = relig_income,
  cols = !religion,
  names_to = "income", 
  values_to = "count"
)
```

<div id="htmlwidget-df7de3487a3045c5f0ee" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-df7de3487a3045c5f0ee">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180"],["Agnostic","Agnostic","Agnostic","Agnostic","Agnostic","Agnostic","Agnostic","Agnostic","Agnostic","Agnostic","Atheist","Atheist","Atheist","Atheist","Atheist","Atheist","Atheist","Atheist","Atheist","Atheist","Buddhist","Buddhist","Buddhist","Buddhist","Buddhist","Buddhist","Buddhist","Buddhist","Buddhist","Buddhist","Catholic","Catholic","Catholic","Catholic","Catholic","Catholic","Catholic","Catholic","Catholic","Catholic","Don’t know/refused","Don’t know/refused","Don’t know/refused","Don’t know/refused","Don’t know/refused","Don’t know/refused","Don’t know/refused","Don’t know/refused","Don’t know/refused","Don’t know/refused","Evangelical Prot","Evangelical Prot","Evangelical Prot","Evangelical Prot","Evangelical Prot","Evangelical Prot","Evangelical Prot","Evangelical Prot","Evangelical Prot","Evangelical Prot","Hindu","Hindu","Hindu","Hindu","Hindu","Hindu","Hindu","Hindu","Hindu","Hindu","Historically Black Prot","Historically Black Prot","Historically Black Prot","Historically Black Prot","Historically Black Prot","Historically Black Prot","Historically Black Prot","Historically Black Prot","Historically Black Prot","Historically Black Prot","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jehovah's Witness","Jewish","Jewish","Jewish","Jewish","Jewish","Jewish","Jewish","Jewish","Jewish","Jewish","Mainline Prot","Mainline Prot","Mainline Prot","Mainline Prot","Mainline Prot","Mainline Prot","Mainline Prot","Mainline Prot","Mainline Prot","Mainline Prot","Mormon","Mormon","Mormon","Mormon","Mormon","Mormon","Mormon","Mormon","Mormon","Mormon","Muslim","Muslim","Muslim","Muslim","Muslim","Muslim","Muslim","Muslim","Muslim","Muslim","Orthodox","Orthodox","Orthodox","Orthodox","Orthodox","Orthodox","Orthodox","Orthodox","Orthodox","Orthodox","Other Christian","Other Christian","Other Christian","Other Christian","Other Christian","Other Christian","Other Christian","Other Christian","Other Christian","Other Christian","Other Faiths","Other Faiths","Other Faiths","Other Faiths","Other Faiths","Other Faiths","Other Faiths","Other Faiths","Other Faiths","Other Faiths","Other World Religions","Other World Religions","Other World Religions","Other World Religions","Other World Religions","Other World Religions","Other World Religions","Other World Religions","Other World Religions","Other World Religions","Unaffiliated","Unaffiliated","Unaffiliated","Unaffiliated","Unaffiliated","Unaffiliated","Unaffiliated","Unaffiliated","Unaffiliated","Unaffiliated"],["<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused","<$10k","$10-20k","$20-30k","$30-40k","$40-50k","$50-75k","$75-100k","$100-150k",">150k","Don't know/refused"],[27,34,60,81,76,137,122,109,84,96,12,27,37,52,35,70,73,59,74,76,27,21,30,34,33,58,62,39,53,54,418,617,732,670,638,1116,949,792,633,1489,15,14,15,11,10,35,21,17,18,116,575,869,1064,982,881,1486,949,723,414,1529,1,9,7,9,11,34,47,48,54,37,228,244,236,238,197,223,131,81,78,339,20,27,24,24,21,30,15,11,6,37,19,19,25,25,30,95,69,87,151,162,289,495,619,655,651,1107,939,753,634,1328,29,40,48,51,56,112,85,49,42,69,6,7,9,10,9,23,16,8,6,22,13,17,23,32,32,47,38,42,46,73,9,7,11,13,13,14,18,14,12,18,20,33,40,46,49,63,46,40,41,71,5,2,3,4,2,7,3,4,4,8,217,299,374,365,341,528,407,321,258,597]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>religion<\/th>\n      <th>income<\/th>\n      <th>count<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":3,"columnDefs":[{"className":"dt-right","targets":3},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[3,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

# 🔬 Messy data no.2: Anscombe's Quartet

![](unleashing-SQL-with-R_files/figure-html/unnamed-chunk-30-1.svg)

---

# Anscombe's Quartet

<div id="htmlwidget-f8ff342d01ff98433cda" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-f8ff342d01ff98433cda">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11"],[10,8,13,9,11,14,6,4,12,7,5],[10,8,13,9,11,14,6,4,12,7,5],[10,8,13,9,11,14,6,4,12,7,5],[8,8,8,8,8,8,8,19,8,8,8],[8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68],[9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74],[7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73],[6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>x1<\/th>\n      <th>x2<\/th>\n      <th>x3<\/th>\n      <th>x4<\/th>\n      <th>y1<\/th>\n      <th>y2<\/th>\n      <th>y3<\/th>\n      <th>y4<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":6,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4,5,6,7,8]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[6,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

# 🧹 Cleaning Anscombe's Quartet

```r
pivot_longer(
  data = anscombe,
  cols = everything(),
  names_to = c(".value", "set"),
  names_pattern = "(.)(.)"
)
```
<div id="htmlwidget-ded9d8e4d8ff35b284ce" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ded9d8e4d8ff35b284ce">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44"],["1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4"],[10,10,10,8,8,8,8,8,13,13,13,8,9,9,9,8,11,11,11,8,14,14,14,8,6,6,6,8,4,4,4,19,12,12,12,8,7,7,7,8,5,5,5,8],[8.04,9.14,7.46,6.58,6.95,8.14,6.77,5.76,7.58,8.74,12.74,7.71,8.81,8.77,7.11,8.84,8.33,9.26,7.81,8.47,9.96,8.1,8.84,7.04,7.24,6.13,6.08,5.25,4.26,3.1,5.39,12.5,10.84,9.13,8.15,5.56,4.82,7.26,6.42,7.91,5.68,4.74,5.73,6.89]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>set<\/th>\n      <th>x<\/th>\n      <th>y<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":4,"columnDefs":[{"className":"dt-right","targets":[2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[4,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---
background-image: url(https://github.com/yihui/xaringan/releases/download/v0.0.2/karl-moustache.jpg)
---

background-image: url(https://upload.wikimedia.org/wikipedia/commons/d/d5/P_satellite_dish.svg)

background-size: contain

---

# Getting connected to SQL

---

*For MySQL and T-SQL, see* </br> 
*[this guide](https://htmlpreview.github.io/?https://github.com/frycast/SQL_course/blob/master/R/connecting-R/databases-in-R.html)*</br>
*covering local and remote connections from R*

---

# The wonderful [SQLite](https://www.sqlite.org/index.html)

SQLite is:
- Small
- Fast
- Self contained
- High reliability
- Full-featured

The most used database engine in the world.

Around one trillion active databases in use.

Hundreds of SQLite databases on any smart phone.

One of the top 5 most widely deployed software modules of any kind.

---

background-image: url(https://upload.wikimedia.org/wikipedia/commons/3/38/SQLite370.svg)

background-size: contain

---

# Connect or create, one line!

First get our directories organised (see [umbrella](https://github.com/frycast/umbrella)).

```r
library(here)
```

Now, create or connect

```r
library(RSQLite)
con <- dbConnect(
  SQLite(), 
  here("data", "raw", "Sandpit.sqlite")
)
```

*If it can't be found, then* [`Sandpit.sqlite`](https://github.com/frycast/SQL_course/tree/master/R/sqlite-R) *will be created (empty).*

---

# Use the Sandpit database

The package 📦[`dplyr`](https://dplyr.tidyverse.org/) gives us a whole 'grammar of data manipulation'.

The package 📦[`dbplyr`](https://dbplyr.tidyverse.org/) allows `dplyr` to talk to SQL.

```r
library(dplyr)
library(dbplyr)
```

**So many packages!** From now on I'll write them like this:

```r
dplyr::tbl()
```

The `dplyr::` means we are using a function from `dplyr`. The function we're using is called `tbl`.

---

# 🍌 Connect to a table

```r
banana <- dplyr::tbl(con, "Ape_Banana")
banana
```

```
## # Source:   table<Ape_Banana> [?? x 7]
## # Database: sqlite 3.36.0
## #   [D:\CloudDrive\OneDrive\Cloud-drive\Teach\Courses\Workshops\SQL\repositories\umbrella\data\raw\Sandpit.sqlite]
##    BananaID TasteRank DatePicked DateEaten  Ripe TreeID Comments             
##       <int>     <int>      <int>     <int> <int>  <int> <chr>                
##  1        1         2   20181003  20181004     0      1 <NA>                 
##  2        2         4   20181003  20181004     1      2 <NA>                 
##  3        3         4   20181003  20181004     1      2 <NA>                 
##  4        4         5   20181003  20181006     1      1 <NA>                 
##  5        5         5   20181003  20181006     1      2 best banana ever     
##  6        6         3   20181003  20181004     1      2 <NA>                 
##  7        7         2   20181002  20181004     0      3 <NA>                 
##  8        8         5   20181002  20181005     1      3 smooth and delectable
##  9        9         3   20181002  20181003     1      4 <NA>                 
## 10       10         3   20181002  20181003     1      5 <NA>                 
## # ... with more rows
```

---

# Grammar of data manipulation 💘 SQL

The function `dplyr::filter` is like the SQL `WHERE` clause.

```r
ripe_banana <- banana %>% dplyr::filter(Ripe == 1)
ripe_banana
```

```
## # Source:   lazy query [?? x 7]
## # Database: sqlite 3.36.0
## #   [D:\CloudDrive\OneDrive\Cloud-drive\Teach\Courses\Workshops\SQL\repositories\umbrella\data\raw\Sandpit.sqlite]
##    BananaID TasteRank DatePicked DateEaten  Ripe TreeID Comments             
##       <int>     <int>      <int>     <int> <int>  <int> <chr>                
##  1        2         4   20181003  20181004     1      2 <NA>                 
##  2        3         4   20181003  20181004     1      2 <NA>                 
##  3        4         5   20181003  20181006     1      1 <NA>                 
##  4        5         5   20181003  20181006     1      2 best banana ever     
##  5        6         3   20181003  20181004     1      2 <NA>                 
##  6        8         5   20181002  20181005     1      3 smooth and delectable
##  7        9         3   20181002  20181003     1      4 <NA>                 
##  8       10         3   20181002  20181003     1      5 <NA>                 
##  9       12         5   20181002  20181005     1      4 <NA>                 
## 10       16         5   20181001  20181004     1      5 a culinary delight   
## # ... with more rows
```

---

# Grammar of data manipulation 💘 SQL

But seriously, `dplyr::filter` is *actually* the `WHERE` clause.

```r
ripe_banana %>% dplyr::show_query()
```

```
## <SQL>
## SELECT *
## FROM `Ape_Banana`
## WHERE (`Ripe` = 1.0)
```

To execute the query:

```r
ripe_banana %>% dplyr::collect()
```

```
## # A tibble: 34 x 7
##    BananaID TasteRank DatePicked DateEaten  Ripe TreeID Comments             
##       <int>     <int>      <int>     <int> <int>  <int> <chr>                
##  1        2         4   20181003  20181004     1      2 <NA>                 
##  2        3         4   20181003  20181004     1      2 <NA>                 
##  3        4         5   20181003  20181006     1      1 <NA>                 
##  4        5         5   20181003  20181006     1      2 best banana ever     
##  5        6         3   20181003  20181004     1      2 <NA>                 
##  6        8         5   20181002  20181005     1      3 smooth and delectable
##  7        9         3   20181002  20181003     1      4 <NA>                 
##  8       10         3   20181002  20181003     1      5 <NA>                 
##  9       12         5   20181002  20181005     1      4 <NA>                 
## 10       16         5   20181001  20181004     1      5 a culinary delight   
## # ... with 24 more rows
```

---

# ✍️ We can write our own SQL

```r
DBI::dbGetQuery(con, 
"
SELECT TreeID, COUNT(*) AS NumRipe, AVG(TasteRank) AS AvgTaste 
FROM Ape_Banana
WHERE DatePicked > '20180101'
GROUP BY TreeID;
"
)
```

<div id="htmlwidget-0aa4c27d37ab40643939" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-0aa4c27d37ab40643939">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18"],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18],[2,7,3,3,4,1,4,2,2,1,2,4,2,2,3,3,2,3],[3.5,3.71428571428571,3.33333333333333,3,3.25,5,2.75,3.5,3,5,3.5,3,4,4.5,3.33333333333333,2.33333333333333,4.5,4]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>TreeID<\/th>\n      <th>NumRipe<\/th>\n      <th>AvgTaste<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":4,"columnDefs":[{"className":"dt-right","targets":[1,2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[4,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

# It's up to you

```r
tasty_bananas <- banana %>% 
  dplyr::filter(DatePicked > '20180101') %>% 
  dplyr::group_by(TreeID) %>% 
  dplyr::summarise(NumRipe = n(), AvgTaste = mean(TasteRank, na.rm=T))
```

We can always inspect the SQL code created by `dplyr`.

```r
tasty_bananas %>% show_query()
```

```
## <SQL>
## SELECT `TreeID`, COUNT(*) AS `NumRipe`, AVG(`TasteRank`) AS `AvgTaste`
## FROM `Ape_Banana`
## WHERE (`DatePicked` > '20180101')
## GROUP BY `TreeID`
```

---

# Advanced: programmatically edit queries

```r
for (this_taste in c(3,4,5)) {
  res <- DBI::dbGetQuery(con, stringr::str_interp("
    SELECT * 
    FROM Ape_Banana
    WHERE TasteRank = ${this_taste}                  
  "))
  cat("\nResults for taste = ", this_taste, "\n")
  print(nrow(res))
}
```

```
## 
## Results for taste =  3 
## [1] 7
## 
## Results for taste =  4 
## [1] 10
## 
## Results for taste =  5 
## [1] 18
```

---

# Advanced no.2: batch process results

Send the query without executing it.

```r
rs <- DBI::dbSendQuery(con, "
  SELECT *
  FROM Ape_Banana
")
```

Process results, 20 at a time

```r
while (!DBI::dbHasCompleted(rs)) {
  twenty_bananas <- DBI::dbFetch(rs, n = 20)
  
  # << insert processing on twenty_bananas here >>
  
  print(nrow(twenty_bananas))
}
```

```
## [1] 20
## [1] 20
## [1] 10
```

---

# Saving results

Collect the results

```r
tasty_bananas <- dplyr::collect(tasty_bananas)
```

Save as a CSV (📦[`readr`](https://readr.tidyverse.org/) is better at it).

```r
library(readr)

# Choose the processed data directory
location <- here("data", "processed", "tasty_bananas.csv")

# Write CSV
readr::write_csv(tasty_bananas, location)
```

---

# Saving results

Or save to SQLite.

```r
# Choose the processed data directory
location <- here("data", "processed", "Sandpit_results.sqlite")

# Connect or create database
res_con <- DBI::dbConnect(RSQLite::SQLite(), location)

# Save the table
DBI::dbWriteTable(res_con, "tasty_bananas", tasty_bananas)
```

---

# Don't forget to disconnect

When you're done.

```r
DBI::dbDisconnect(con)
DBI::dbDisconnect(res_con)
```

---

background-image: url(https://upload.wikimedia.org/wikipedia/commons/7/74/Space_Exploration_Vehicle.svg)

background-size: contain

---

# Live guide and demonstrations

---

# Live guide and demonstrations

- Using GitHub
- Downloading [umbrella](https://github.com/frycast/umbrella) repository
- Using [DB Browser](https://sqlitebrowser.org/) with SQLite
- Live demo of connecting and exploring

Also, see the [R code folder](https://github.com/frycast/SQL_course/tree/master/R/sqlite-R) on the course repo.

---