The steeper the learning curve, the bigger the reward.
Conway Venn Diagram (drewconway.com).
Many data scientists have published modifications of this model. Can you think of some other competencies essential for data scientists?
Source: lucidmanager.org
library(readr)
library(dplyr)
labdata <- read_csv("../data/water_quality.csv")
group_by(labdata, Measure) %>%
summarise(min = min(Result),
mean = mean(Result),
sd = sd(Result))## # A tibble: 4 × 4
## Measure min mean sd
## <chr> <dbl> <dbl> <dbl>
## 1 Chlorine Total 0.025 0.499 0.499
## 2 E. coli 0 0.00789 0.136
## 3 THM 0.0005 0.0366 0.0978
## 4 Turbidity 0.05 0.360 1.03
import pandas as pd
import numpy as np
labdata = pd.read_csv("../data/water_quality.csv")
labdata.groupby("Measure")["Result"].agg([np.min, np.mean, np.std])## min mean std
## Measure
## Chlorine Total 0.0250 0.498967 0.498730
## E. coli 0.0000 0.007895 0.135584
## THM 0.0005 0.036565 0.097821
## Turbidity 0.0500 0.360327 1.025126
using DataFrames, CSV, StatsBase
labdata = DataFrame(CSV.File("../data/water_quality.csv"));
combine(groupby(labdata, :Measure),
:Result => minimum,
:Result => mean,
:Result => std)## 4×4 DataFrame
## Row │ Measure Result_minimum Result_mean Result_std
## │ String15 Float64 Float64 Float64
## ─────┼─────────────────────────────────────────────────────────
## 1 │ Chlorine Total 0.025 0.498967 0.49873
## 2 │ E. coli 0.0 0.00789474 0.135584
## 3 │ Turbidity 0.05 0.360327 1.02513
## 4 │ THM 0.0005 0.0365647 0.0978211
R Studio
screenshot.
https://github.com/pprevos/r4h2o/## [1] 0.01767146
## [1] 150
diameter <- 50:350
pipe_area <- (pi / 4) * (diameter / 1000)^2
par(mar = c(4, 4, 1, 1))
plot(diameter, pipe_area, type = "l", col = "blue")
abline(v = 150, col = "grey", lty = 2)
abline(h = (pi / 4) * (150 / 1000)^2, col = "grey", lty = 2)
points(150, (pi / 4) * (150 / 1000)^2, col = "red")
First computer bug (1947).
Open: scripts/02-basics.R
Use meaningful variable names and use a consistent naming convention
flowdaily: All lowercaseflow.daily: Period-separatedflow_daily: Snake caseflowDaily: Camel caseFlowDaily: Upper camel casevignette())scripts/02-basics.R script and explore the
content# Sedimentation Tank
diameter <- 8
depth_1 <- 3
depth_2 <- 1
volume <- ((pi / 4) * diameter^2) * (depth_1 + (depth_2 / 3))
flow_rate <- 4
(detention_time <- volume / flow_rate)## [1] 41.8879
\[Q = \frac{2}{3} C_d \sqrt{2g} \; bh^\frac{3}{2}\]
Create an R script and answer:
\(Q = \frac{2}{3} C_d \sqrt{2g} \; b^\frac{3}{2}\)
scripts/02-irrigation.R
install.packages("dplyr")dplyr::filter()library(dplyr)library(tidyverse) or individual
packagesInstall the Tidyverse packages on your (cloud) computer
readr package for CSV files (part of Tidyverse)
read_csv() faster alternative for
read.csv()# CSV Files
library(readr)
labdata <- read_csv("data/water_quality.csv")
# Reading Excel spreadsheets
labdata <- readxl::read_excel("data/water_quality.xlsx",
skip = 2, sheet = "data")Open scripts/03-data-frames.R
"abcd")"2024-07-08")"Male", "Female", "Other")TRUE, FALSE)Conversion: as.numeric(), as.character,
as.Date().
Scalar, vector and data frame / tibble (matrix)
Measures of:
Open scripts/04-statistics.R
R implements Bessel’s Correction
\[s=\sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1}}\]
## [1] 1.186115
## [1] 1.186115
Hyndman and Fan (1996) Sample Quantiles in Statistical Packages, The American Statistician.
vignette("dplyr")vignette("dplyr")library(tidyverse)
# Bendigo weather station
bom <- read_csv("../data/IDCJAC0009_081123_1800_Data.csv")
bom_grouped <- group_by(bom, Year)
bom_annual <- summarise(bom_grouped,
Rainfall = sum(`Rainfall amount (millimetres)`,
na.rm = TRUE))
slice_max(bom_annual, order_by = Rainfall, n = 5)## # A tibble: 5 × 2
## Year Rainfall
## <dbl> <dbl>
## 1 2010 1060.
## 2 2022 847.
## 3 1992 776
## 4 2011 761
## 5 1993 690.
Confusing graphics.
Which one is Cambodia?
ggplot2.tidyverse.org/Open the 05-visualise.R script
Two reasons to use colour:
Reasons not to use colour:
geom_hline(yintercept = 0.25, col = "red"scale_y_log10()library(tidyverse)
labdata <- read_csv("../data/water_quality.csv")
thm_merton_southwold <- filter(labdata, Measure == "THM" &
(Suburb == "Merton" |
Suburb == "Southwold"))
ggplot(thm_merton_southwold, aes(Suburb, Result)) +
geom_boxplot() +
scale_y_log10() +
geom_hline(yintercept = 0.25, col = "red")Open 06-chlorine-taste.Rmd.
variable name`na.rm = TRUE
option in the sum() functionslice_max() function to list the top five
years