\(R^4H_2O\)

Day 1

08 July 2024

Introductions

Learning Objectives Level 1

  • Apply the principles of strategic data science to solve water problems
  • Write R code to load, analyse, and visualise data
  • Diagnose water quality data with descriptive statistics
  • Develop presentations, reports, and applications to share results

Course Structure

  • Case study approach using realistic synthetic data
  • Hands-on sessions
  • Based on generic principles

Program Level 1

  1. Principles of Data Science
  2. Basics of the R Language
  3. Case Study: Channel flows
  4. Case Study: Water quality
    • Reading CSV and Spreadsheet data
    • Descriptive Statistics
  5. Visualise data with ggplot2
  6. Data science workflow
  7. Creating data products

Data Science Learning Curve

The steeper the learning curve, the bigger the reward.

What is Data Science?

Conway Venn Diagram (drewconway.com).

Many data scientists have published modifications of this model. Can you think of some other competencies essential for data scientists?

What is Useful Data Science?

What is Good Data Science?

Spreadsheets are Chaos

Source: lucidmanager.org

Code is Poetry: R

library(readr)
library(dplyr)
labdata <- read_csv("../data/water_quality.csv")
group_by(labdata, Measure) %>% 
    summarise(min = min(Result),
              mean = mean(Result),
              sd = sd(Result))
## # A tibble: 4 × 4
##   Measure           min    mean     sd
##   <chr>           <dbl>   <dbl>  <dbl>
## 1 Chlorine Total 0.025  0.499   0.499 
## 2 E. coli        0      0.00789 0.136 
## 3 THM            0.0005 0.0366  0.0978
## 4 Turbidity      0.05   0.360   1.03

Code is Poetry: Python

import pandas as pd
import numpy as np
labdata = pd.read_csv("../data/water_quality.csv")
 
labdata.groupby("Measure")["Result"].agg([np.min, np.mean, np.std])
##                    min      mean       std
## Measure                                   
## Chlorine Total  0.0250  0.498967  0.498730
## E. coli         0.0000  0.007895  0.135584
## THM             0.0005  0.036565  0.097821
## Turbidity       0.0500  0.360327  1.025126

Code is Poetry: Julia

using DataFrames, CSV, StatsBase
labdata = DataFrame(CSV.File("../data/water_quality.csv"));

combine(groupby(labdata, :Measure),
        :Result => minimum,
        :Result => mean,
        :Result => std)
## 4×4 DataFrame
##  Row │ Measure         Result_minimum  Result_mean  Result_std
##      │ String15        Float64         Float64      Float64
## ─────┼─────────────────────────────────────────────────────────
##    1 │ Chlorine Total          0.025    0.498967     0.49873
##    2 │ E. coli                 0.0      0.00789474   0.135584
##    3 │ Turbidity               0.05     0.360327     1.02513
##    4 │ THM                     0.0005   0.0365647    0.0978211

Basics of the R Language

R Studio screenshot.

  1. Register at posit.cloud
  2. New Project From Github Repo
  3. Enter: https://github.com/pprevos/r4h2o/
  4. RStudio will download the project files

Arithmetic

  • Console (REPL: Read-Eval-Print Loop)
  • Enter lines below
diameter <- 150
pipe_area <- (pi / 4) * (diameter / 1000)^2
pipe_area
## [1] 0.01767146
sqrt(pipe_area / (pi / 4)) * 1000
## [1] 150

Internet meme.

6 / 2 * (1 + 2)
## [1] 9

REPL and Scripting

Vectors

complaints <- c(12, 13, 23, 45, 22, 99, 31)

sum(complaints)
prod(complaints)
abs(complaints)
exp(complaints)
factorial(complaints)
log(complaints, base = 3)
log10(complaints)

Basic Plotting 1

diameter <- 50:350
pipe_area <- (pi / 4) * (diameter / 1000)^2
par(mar = c(4, 4, 1, 1))
plot(diameter, pipe_area, type = "l", col = "blue")
abline(v = 150, col = "grey", lty = 2)
abline(h = (pi / 4) * (150 / 1000)^2, col = "grey", lty = 2)
points(150, (pi / 4) * (150 / 1000)^2, col = "red")

Coding Principles

First computer bug (1947).

Open: scripts/02-basics.R

Variable Names

Use meaningful variable names and use a consistent naming convention

  • flowdaily: All lowercase
  • flow.daily: Period-separated
  • flow_daily: Snake case
  • flowDaily: Camel case
  • FlowDaily: Upper camel case

Bugs

  • Debugging (step through code)
  • Typos (lintr)
  • Look at indentation
  • Error messages (copy and paste into search engine)
  • Help files
  • Vignettes (vignette())

Scripts and Projects

  • Working Directory
  • Open Script
  • Commenting
  • Indentation
  • Follow a style guide
  • Open the scripts/02-basics.R script and explore the content

Make this code more readable

  • Diameter: \(d=8m\)
  • Depth: \(d_1=3, d_2=0.9m\)
  • Flow \(Q=4 m^3/h\)
  • Detention time \(t\)

\[t = \frac{\pi}{4} D^2 \big( d_1 + \frac{d_2}{3} \big)\]

# Hopper Bottom Sedimentation Tank
D<-8
d1<-3
d2<-0.9
v=(pi/4)*D^2*(d1+(d2/3))
q<-4

(t<-v/q)
## [1] 41.46902

Improved version

# Sedimentation Tank
diameter <- 8
depth_1 <- 3
depth_2 <- 1
volume <- ((pi / 4) * diameter^2) * (depth_1 + (depth_2 / 3))
flow_rate <- 4

(detention_time <- volume / flow_rate)
## [1] 41.8879

R4H2O Cheat Sheet

PDF version

Case Study 0: Channel Flow

Discharge formula

\[Q = \frac{2}{3} C_d \sqrt{2g} \; bh^\frac{3}{2}\]

  • \(Q\): Discharge in m3/s
  • \(C_d\): Discharge coefficient
  • \(g\): Acceleration of gravity
  • \(b\): Crest length [m]
  • h: Head above crest [m]

Coding Practice

Create an R script and answer:

  1. What is the flow (\(Q\)) in the channel in m3/s when the height \(h = 100\) mm?
  2. What is the average flow for these three heights: 150mm, 136mm, 75mm, in litres per second?

\(Q = \frac{2}{3} C_d \sqrt{2g} \; b^\frac{3}{2}\)

  • \(C_d = 0.62\)
  • \(g = 9.81 m/s^2\)
  • \(b = 0.5 m\)

scripts/02-irrigation.R

Case Study 1: Water Quality

Base R functions

  • Basic programming functions
  • Arithmetic
  • Statistics
  • Plotting
  • Extensible with functions (packages)

R Packages

  • Packages to extend base functionality
    • Distributed mainly through CRAN
    • Comprehensive R Archive Network
  • Install with install.packages("dplyr")
  • Call new functions with package prefix: dplyr::filter()
  • Call library to access functions directly: library(dplyr)

Tidyverse

  • Collection of R packages
  • Easy data manipulation
  • ‘syntactic sugar’
  • Activate with library(tidyverse) or individual packages

Install the Tidyverse packages on your (cloud) computer

Obtaining Data

  • Database (SQL, Oracle)
  • Desktop files (spreadsheets, CSV)
  • Web (HTML, XML, JSON)
    • API
    • Scraping

Reading CSV files

  • readr package for CSV files (part of Tidyverse)
    • read_csv() faster alternative for read.csv()
    • Look at help text
  • File names in R:
    • Unix-based (slash instead of backslash)
# CSV Files
library(readr)
labdata <- read_csv("data/water_quality.csv")

# Reading Excel spreadsheets
labdata <- readxl::read_excel("data/water_quality.xlsx", 
                              skip = 2, sheet = "data")

Open scripts/03-data-frames.R

Variable Types

Variable Classes

  • Numeric (\(\mathbb{R}\))
  • Text ("abcd")
  • Dates ("2024-07-08")
  • Factors (classifications): ("Male", "Female", "Other")
  • Boolean (TRUE, FALSE)
  • Integer (\({\ldots -3, -2, -1, 0, 1, 2, 3, \ldots}\))
  • Complex numbers (\(a + bi\))

Conversion: as.numeric(), as.character, as.Date().

Scalar, vector and data frame / tibble (matrix)

Filtering

# Three methods

labdata[labdata$Measure == "Turbidity", ]

subset(labdata, Measure == "Turbidity")

library(dplyr)
filter(labdata, Measure == "Turbidity")

turbidity <- filter(labdata, Measure == "Turbidity" & Result > 5)

Descriptive Statistics

Measures of:

  1. Frequency (counting)
  2. Central tendency (mean, median, mode)
  3. Dispersion (range, variance)
  4. Position (percentiles)
  5. Shape (skew and kurtosis)

Open scripts/04-statistics.R

Measures of Dispersion (Standard Deviation)

R implements Bessel’s Correction

\[s=\sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1}}\]

sqrt(sum((turbidity$Result - mean(turbidity$Result))^2) / 
       (length(turbidity$Result) - 1))
## [1] 1.186115
sd(turbidity$Result)
## [1] 1.186115

Measure of Position (Quantiles and Percentiles)

  1. Place the observations in ascending order: \(y_1, y_2, \ldots y_n\).
  2. Calculate the rank (\(r\)) of the required percentile
  3. Interpolate between adjacent numbers: \[y_r = y_{\lfloor r \rfloor}+ (y_{r_{\lceil r \rceil}} - y_{\lfloor r \rfloor})(r - \lfloor r \rfloor)\]
    • \(y\): Observation
    • \(r\): Ranking number
    • \(\lfloor r \rfloor\): Floor or \(r\)
    • \(\lceil r \rceil\): Ceiling of \(r\)

Hyndman and Fan (1996) Sample Quantiles in Statistical Packages, The American Statistician.

Measures of Position (continued)

Grouped Analysis

  • Creates a grouped data frame
  • Loop evaluation over groups
  • vignette("dplyr")

Grouped Analysis

  • Creates a grouped data frame
  • Loop evaluation over groups
  • vignette("dplyr")

Solution

library(tidyverse)
# Bendigo weather station
bom <- read_csv("../data/IDCJAC0009_081123_1800_Data.csv")
bom_grouped <- group_by(bom, Year)
bom_annual <- summarise(bom_grouped, 
                        Rainfall = sum(`Rainfall amount (millimetres)`, 
                                       na.rm = TRUE))
slice_max(bom_annual, order_by = Rainfall, n = 5)
## # A tibble: 5 × 2
##    Year Rainfall
##   <dbl>    <dbl>
## 1  2010    1060.
## 2  2022     847.
## 3  1992     776 
## 4  2011     761 
## 5  1993     690.

Data Story-Telling

Ugly Data

Source: reddit.com/r/dataisugly/

Confusing graphics.

Which one is Cambodia?

Tell a Story With Data

  • Keep it simple
  • Add a point of interest (comparison, limits, trend etc.)
  • One story per graph

Data-Pixel Ratio

Visualising data with ggplot2

  • Part of the tidyverse
  • ggplot2.tidyverse.org/
  • Implements Grammar of Graphics

Grammar-of-Graphics

Open the 05-visualise.R script

Using Colours in Graphics

Two reasons to use colour:

  • Corporate or publisher house style
  • Communicate a value

Reasons not to use colour:

  • Aesthetics
  • Prettyfication

Coding Practice

  • Filter the laboratory data for THM in Merton and Southwold
  • Draw boxplot to visualise the distribution of results
  • Visualise limit:
    • geom_hline(yintercept = 0.25, col = "red"
  • Logarithmic scale
    • scale_y_log10()

Coding Practice

library(tidyverse)
labdata <- read_csv("../data/water_quality.csv") 
thm_merton_southwold <- filter(labdata, Measure == "THM" & 
                                 (Suburb == "Merton" | 
                                    Suburb == "Southwold"))
ggplot(thm_merton_southwold, aes(Suburb, Result)) + 
  geom_boxplot() + 
  scale_y_log10() + 
  geom_hline(yintercept = 0.25, col = "red")

Data Science Workflow

Reproducible and Replicable Analysis

Literate Programming

What You See is What You Mean

R Markdown

Open 06-chlorine-taste.Rmd.

Start Your Journey to Data Science

  1. Understand the basics
  2. Create simple programs
  3. Practice
  4. Ask for help (StackExchange, Reddit, #Rstats)
    • Minimal Working Example
  5. Build projects
  6. Help others

Coding Challenge

  • Go to the Bureau of Meteorology website
  • Download daily rainfall data for your favourite weather station
  • Determine the top five years with the highest total rainfall
  • Tips:
    • Variable names with spaces need to be between back ticks: `variable name`
    • The data has missing values. Use the na.rm = TRUE option in the sum() function
    • Use the slice_max() function to list the top five years