\(R^4H_2O\)

Day 1

08 July 2024

Introductions

Learning Objectives Level 1

Apply the principles of strategic data science to solve water problems
Write R code to load, analyse, and visualise data
Diagnose water quality data with descriptive statistics
Develop presentations, reports, and applications to share results

Course Structure

Case study approach using realistic synthetic data
Hands-on sessions
Based on generic principles

Program Level 1

Principles of Data Science
Basics of the R Language
Case Study: Channel flows
Case Study: Water quality
- Reading CSV and Spreadsheet data
- Descriptive Statistics
Visualise data with ggplot2
Data science workflow
Creating data products

Data Science Learning Curve

The steeper the learning curve, the bigger the reward.

What is Data Science?

Conway Venn Diagram (drewconway.com).

Many data scientists have published modifications of this model. Can you think of some other competencies essential for data scientists?

What is Useful Data Science?

What is Good Data Science?

Spreadsheets are Chaos

Source: lucidmanager.org

Code is Poetry: R

library(readr)
library(dplyr)
labdata <- read_csv("../data/water_quality.csv")
group_by(labdata, Measure) %>% 
    summarise(min = min(Result),
              mean = mean(Result),
              sd = sd(Result))

## # A tibble: 4 × 4
##   Measure           min    mean     sd
##   <chr>           <dbl>   <dbl>  <dbl>
## 1 Chlorine Total 0.025  0.499   0.499 
## 2 E. coli        0      0.00789 0.136 
## 3 THM            0.0005 0.0366  0.0978
## 4 Turbidity      0.05   0.360   1.03

Code is Poetry: Python

import pandas as pd
import numpy as np
labdata = pd.read_csv("../data/water_quality.csv")
 
labdata.groupby("Measure")["Result"].agg([np.min, np.mean, np.std])

##                    min      mean       std
## Measure                                   
## Chlorine Total  0.0250  0.498967  0.498730
## E. coli         0.0000  0.007895  0.135584
## THM             0.0005  0.036565  0.097821
## Turbidity       0.0500  0.360327  1.025126

Code is Poetry: Julia

using DataFrames, CSV, StatsBase
labdata = DataFrame(CSV.File("../data/water_quality.csv"));

combine(groupby(labdata, :Measure),
        :Result => minimum,
        :Result => mean,
        :Result => std)

## 4×4 DataFrame
##  Row │ Measure         Result_minimum  Result_mean  Result_std
##      │ String15        Float64         Float64      Float64
## ─────┼─────────────────────────────────────────────────────────
##    1 │ Chlorine Total          0.025    0.498967     0.49873
##    2 │ E. coli                 0.0      0.00789474   0.135584
##    3 │ Turbidity               0.05     0.360327     1.02513
##    4 │ THM                     0.0005   0.0365647    0.0978211

Basics of the R Language

R Studio screenshot.

Register at posit.cloud
New Project From Github Repo
Enter: https://github.com/pprevos/r4h2o/
RStudio will download the project files

Arithmetic

Console (REPL: Read-Eval-Print Loop)
Enter lines below

diameter <- 150
pipe_area <- (pi / 4) * (diameter / 1000)^2
pipe_area

## [1] 0.01767146

sqrt(pipe_area / (pi / 4)) * 1000

## [1] 150

Internet meme.

6 / 2 * (1 + 2)

## [1] 9

REPL and Scripting

Vectors

complaints <- c(12, 13, 23, 45, 22, 99, 31)

sum(complaints)
prod(complaints)
abs(complaints)
exp(complaints)
factorial(complaints)
log(complaints, base = 3)
log10(complaints)

Basic Plotting 1

diameter <- 50:350
pipe_area <- (pi / 4) * (diameter / 1000)^2
par(mar = c(4, 4, 1, 1))
plot(diameter, pipe_area, type = "l", col = "blue")
abline(v = 150, col = "grey", lty = 2)
abline(h = (pi / 4) * (150 / 1000)^2, col = "grey", lty = 2)
points(150, (pi / 4) * (150 / 1000)^2, col = "red")

Coding Principles

First computer bug (1947).

Open: scripts/02-basics.R

Variable Names

Use meaningful variable names and use a consistent naming convention

flowdaily: All lowercase
flow.daily: Period-separated
flow_daily: Snake case
flowDaily: Camel case
FlowDaily: Upper camel case

Bugs

Debugging (step through code)
Typos (lintr)
Look at indentation
Error messages (copy and paste into search engine)
Help files
Vignettes (vignette())

Scripts and Projects

Working Directory
Open Script
Commenting
Indentation
Follow a style guide
Open the scripts/02-basics.R script and explore the content

Make this code more readable

Diameter: \(d=8m\)
Depth: \(d_1=3, d_2=0.9m\)
Flow \(Q=4 m^3/h\)
Detention time \(t\)

\[t = \frac{\pi}{4} D^2 \big( d_1 + \frac{d_2}{3} \big)\]

# Hopper Bottom Sedimentation Tank
D<-8
d1<-3
d2<-0.9
v=(pi/4)*D^2*(d1+(d2/3))
q<-4

(t<-v/q)

## [1] 41.46902

Improved version

# Sedimentation Tank
diameter <- 8
depth_1 <- 3
depth_2 <- 1
volume <- ((pi / 4) * diameter^2) * (depth_1 + (depth_2 / 3))
flow_rate <- 4

(detention_time <- volume / flow_rate)

## [1] 41.8879

R4H2O Cheat Sheet

PDF version

Case Study 0: Channel Flow

Discharge formula

\[Q = \frac{2}{3} C_d \sqrt{2g} \; bh^\frac{3}{2}\]

\(Q\): Discharge in m³/s
\(C_d\): Discharge coefficient
\(g\): Acceleration of gravity
\(b\): Crest length [m]
h: Head above crest [m]

Coding Practice

Create an R script and answer:

What is the flow (\(Q\)) in the channel in m³/s when the height \(h = 100\) mm?
What is the average flow for these three heights: 150mm, 136mm, 75mm, in litres per second?

\(Q = \frac{2}{3} C_d \sqrt{2g} \; b^\frac{3}{2}\)

\(C_d = 0.62\)
\(g = 9.81 m/s^2\)
\(b = 0.5 m\)

scripts/02-irrigation.R

Case Study 1: Water Quality

Base R functions

Basic programming functions
Arithmetic
Statistics
Plotting
Extensible with functions (packages)

R Packages

Packages to extend base functionality
- Distributed mainly through CRAN
- Comprehensive R Archive Network
Install with install.packages("dplyr")
Call new functions with package prefix: dplyr::filter()
Call library to access functions directly: library(dplyr)

Tidyverse

Collection of R packages
Easy data manipulation
‘syntactic sugar’
Activate with library(tidyverse) or individual packages

Install the Tidyverse packages on your (cloud) computer

Obtaining Data

Database (SQL, Oracle)
Desktop files (spreadsheets, CSV)
Web (HTML, XML, JSON)
- API
- Scraping

Reading CSV files

readr package for CSV files (part of Tidyverse)
- read_csv() faster alternative for read.csv()
- Look at help text
File names in R:
- Unix-based (slash instead of backslash)

# CSV Files
library(readr)
labdata <- read_csv("data/water_quality.csv")

# Reading Excel spreadsheets
labdata <- readxl::read_excel("data/water_quality.xlsx", 
                              skip = 2, sheet = "data")

Open scripts/03-data-frames.R

Variable Types

Variable Classes

Numeric (\(\mathbb{R}\))
Text ("abcd")
Dates ("2024-07-08")
Factors (classifications): ("Male", "Female", "Other")
Boolean (TRUE, FALSE)
Integer (\({\ldots -3, -2, -1, 0, 1, 2, 3, \ldots}\))
Complex numbers (\(a + bi\))

Conversion: as.numeric(), as.character, as.Date().

Scalar, vector and data frame / tibble (matrix)

Filtering

# Three methods

labdata[labdata$Measure == "Turbidity", ]

subset(labdata, Measure == "Turbidity")

library(dplyr)
filter(labdata, Measure == "Turbidity")

turbidity <- filter(labdata, Measure == "Turbidity" & Result > 5)

Descriptive Statistics

Measures of:

Frequency (counting)
Central tendency (mean, median, mode)
Dispersion (range, variance)
Position (percentiles)
Shape (skew and kurtosis)

Open scripts/04-statistics.R

Measures of Dispersion (Standard Deviation)

R implements Bessel’s Correction

\[s=\sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1}}\]

sqrt(sum((turbidity$Result - mean(turbidity$Result))^2) / 
       (length(turbidity$Result) - 1))

## [1] 1.186115

sd(turbidity$Result)

## [1] 1.186115

Measure of Position (Quantiles and Percentiles)

Place the observations in ascending order: \(y_1, y_2, \ldots y_n\).
Calculate the rank (\(r\)) of the required percentile
Interpolate between adjacent numbers: \[y_r = y_{\lfloor r \rfloor}+ (y_{r_{\lceil r \rceil}} - y_{\lfloor r \rfloor})(r - \lfloor r \rfloor)\]
- \(y\): Observation
- \(r\): Ranking number
- \(\lfloor r \rfloor\): Floor or \(r\)
- \(\lceil r \rceil\): Ceiling of \(r\)

Hyndman and Fan (1996) Sample Quantiles in Statistical Packages, The American Statistician.

Measures of Position (continued)

Grouped Analysis

Creates a grouped data frame
Loop evaluation over groups
vignette("dplyr")

Grouped Analysis

Creates a grouped data frame
Loop evaluation over groups
vignette("dplyr")

Solution

library(tidyverse)
# Bendigo weather station
bom <- read_csv("../data/IDCJAC0009_081123_1800_Data.csv")
bom_grouped <- group_by(bom, Year)
bom_annual <- summarise(bom_grouped, 
                        Rainfall = sum(`Rainfall amount (millimetres)`, 
                                       na.rm = TRUE))
slice_max(bom_annual, order_by = Rainfall, n = 5)

## # A tibble: 5 × 2
##    Year Rainfall
##   <dbl>    <dbl>
## 1  2010    1060.
## 2  2022     847.
## 3  1992     776 
## 4  2011     761 
## 5  1993     690.

Data Story-Telling

Ugly Data

Source: reddit.com/r/dataisugly/

Confusing graphics.

Which one is Cambodia?

Tell a Story With Data

Keep it simple
Add a point of interest (comparison, limits, trend etc.)
One story per graph

Data-Pixel Ratio

Visualising data with ggplot2

Part of the tidyverse
ggplot2.tidyverse.org/
Implements Grammar of Graphics

Grammar-of-Graphics

Open the 05-visualise.R script

Using Colours in Graphics

Two reasons to use colour:

Corporate or publisher house style
Communicate a value

Reasons not to use colour:

Aesthetics
Prettyfication

Coding Practice

Filter the laboratory data for THM in Merton and Southwold
Draw boxplot to visualise the distribution of results
Visualise limit:
- geom_hline(yintercept = 0.25, col = "red"
Logarithmic scale
- scale_y_log10()

Coding Practice

library(tidyverse)
labdata <- read_csv("../data/water_quality.csv") 
thm_merton_southwold <- filter(labdata, Measure == "THM" & 
                                 (Suburb == "Merton" | 
                                    Suburb == "Southwold"))
ggplot(thm_merton_southwold, aes(Suburb, Result)) + 
  geom_boxplot() + 
  scale_y_log10() + 
  geom_hline(yintercept = 0.25, col = "red")

Data Science Workflow

Reproducible and Replicable Analysis

Literate Programming

What You See is What You Mean

R Markdown

Open 06-chlorine-taste.Rmd.

Start Your Journey to Data Science

Understand the basics
Create simple programs
Practice
Ask for help (StackExchange, Reddit, #Rstats)
- Minimal Working Example
Build projects
Help others

Coding Challenge

Go to the Bureau of Meteorology website
- bom.gov.au/climate/data
Download daily rainfall data for your favourite weather station
Determine the top five years with the highest total rainfall
Tips:
- Variable names with spaces need to be between back ticks: `variable name`
- The data has missing values. Use the na.rm = TRUE option in the sum() function
- Use the slice_max() function to list the top five years