library(tidyverse)
library(rio)
HW 2: R practice
Due Sunday, September 29th at 11:59pm
This assignment is to be submitted as a PDF file into Canvas. All questions are worth 5 points each, or as marked. The preferred option is to render this Quarto document to PDF and submit that file. This file is set up to render to PDF by default. If you have trouble with that, you can also render to HTML and print to PDF from your browser. (Change typst
in the header lines above to html
)
You will fill in the blanks (indicated by ______
) in the code chunks below. You will also need to write some code from scratch in a few places. Ensure that you have removed all the underscores, otherwise you will get errors. Once you’re done filling in the blanks, you should be able to run all the code chunks without errors. To make the code chunks run, you will need to change the eval
option from false
to true
in the execute options at the top of the document.
You should install any required packages on your own. If you run into trouble, please reach out to me for help. You should do this outside this document, so the code for installing packages is not included here (You should never include code for installing packages in a Quarto document).
The following dataset provides water quality measurements from various beaches in Sydney, Australia for the period Sept 1991 to April, 2025.
This data was used for a Tidy Tuesday competition. You can get a more detailed description here, The inspiration behind using this data for the competition was this article
= rio::import('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-05-20/water_quality.csv') water_quality
Data dictionary
variable | class | description |
---|---|---|
region | character | Area of Sydney City |
council | character | City council responsible for water quality |
swim_site | character | Name of beach/swimming location |
date | date | Date |
time | time | Time of day |
enterococci_cfu_100ml | integer | Enterococci bacteria levels in colony forming units (CFU) per 100 millilitres of water |
water_temperature_c | integer | Water temperature in degrees Celsius |
conductivity_ms_cm | integer | Conductivity in microsiemens per centimetre |
latitude | double | Latitude |
longitude | double | Longitude |
We will start exploring this dataset using R.
Structure of the dataset
1a. How many rows and columns are in the dataset? (5)
dim (water_quality)
[1] 123530 10
1b. What are the variable names in the dataset? (5)
colnames(water_quality)
1c. What are the data types of each variable? (5)
You can use the glimpse
function from the dplyr package to get a quick overview of the data types of each variable. The dplyr package is part of the tidyverse collection of packages, which we loaded at the start of this assignment.
glimpse(water_quality)
Rows: 123,530
Columns: 10
$ region <chr> "Western Sydney", "Sydney Harbour", "Sydney Harb…
$ council <chr> "Hawkesbury City Council", "North Sydney Council…
$ swim_site <chr> "Windsor Beach", "Hayes Street Beach", "Northbri…
$ date <IDate> 2025-04-28, 2025-04-28, 2025-04-28, 2025-04-28…
$ time <chr> "11:00:00", "11:40:00", "10:54:00", "09:28:00", …
$ enterococci_cfu_100ml <int> 620, 64, 160, 54, 720, 230, 120, 280, 60, 100, 1…
$ water_temperature_c <int> 20, 21, 21, 21, 18, 21, 21, 21, 22, 22, 20, 20, …
$ conductivity_ms_cm <int> 248, 45250, 48930, 52700, 64, 39140, 4845, 50600…
$ latitude <dbl> -33.60448, -33.84172, -33.80604, -33.80073, -33.…
$ longitude <dbl> 150.8170, 151.2194, 151.2228, 151.2748, 150.6979…
Data munging (cleaning and transforming the data)
2a. Are there any missing values in the dataset? If so, which variables have missing values and how many? (5)
Hint: The
is.na
function returns aTRUE
if a value is missing (NA) andFALSE
otherwise. In R,TRUE
andFALSE
are treated as 1 and 0 respectively, so you can use thesum
function to count the number of missing values in a variable.
Choose one variable in the dataset and find the number of missing values it has
sum(is.na(water_quality$water_temperature_c))
[1] 75039
You can use the
summarise
function from the dplyr package to count the number of missing values in each variable in the dataset. Theacross
function allows you to apply a function to multiple columns at once.You will also need to create an anonymous function with one argument (the variable name) that returns the sum of missing values in that variable. The formal way to do this is
\(x) sum(is.na(x))
.
|>
water_quality summarise(across(everything(), \(x) sum(is.na(x))))
region council swim_site date time enterococci_cfu_100ml water_temperature_c
1 0 0 0 0 11192 307 75039
conductivity_ms_cm latitude longitude
1 78536 0 0
Read the documentation for summarise
and across
to understand how they work. You can access the documentation by running ?summarise
and ?across
in your R console.
2b. Create a new variable called log_enterococci
that is the natural logarithm of the enterococci_cfu_100ml
variable. (5)
<- water_quality |>
water_quality mutate(log_enterococci = log(enterococci_cfu_100ml + 1)) # to avoid the log(0) situation
2c. Create a new variable called month
that is the month of the year extracted from the date
variable. We want a textual representation of the month (e.g., “January”, “February”, etc.) (5)
See the documentation for
lubridate::month
for how to do this. You will need to change one of the default options in thelubridate::month
function. The lubridate package is part of the tidyverse collection of packages, so you don’t need to load it separately.
<- water_quality |>
water_quality mutate(month = lubridate::month(date, label = TRUE, abbr = FALSE))
2d. The maximum value of the recorded water temperature is 1040, which seems unreasonably high. The question is, are measurements at particular beaches being done poorly. Extract observations where the water temperature is above 35 degrees Celsius, and then create a frequency table of the beaches in this filtered dataset (5)
You can use the
filter
function from the dplyr package to filter the dataset, and then use thecount
function to create a frequency table of theswim_site
variable in the filtered dataset.
|>
water_quality filter(water_temperature_c > 35) |>
count(swim_site)
swim_site n
1 Bilarong Reserve 1
2 Bondi Beach 2
3 Bronte Beach 1
4 Camp Cove 1
5 Clovelly Beach 1
6 Coogee Beach 2
7 Darling Harbour 1
8 Edwards Beach 1
9 Gordons Bay (East) 1
10 Little Bay Beach 1
11 Malabar Beach 1
12 Maroubra Beach 1
13 Narrabeen Lagoon (Birdwood Park) 1
14 Newport Beach 1
15 North Cronulla Beach 1
16 North Narrabeen Beach 1
17 South Maroubra Beach 1
18 South Maroubra Rockpool 1
19 Tamarama Beach 1
20 Wanda Beach 1
21 Warriewood Beach 1
2e. Were the high temperature readings more common in particular years? (5)
<- water_quality |>
water_quality mutate(year = lubridate::year(date))
|>
water_quality filter(water_temperature_c > 35) |>
count(year)
year n
1 2014 1
2 2015 13
3 2017 2
4 2018 1
5 2019 4
6 2020 2
Data summaries
3a. How many unique swim sites are in each region? (5)
Overall, there are 79 unique swim sites in the dataset. Use the group_by
and summarise
functions from the dplyr package to find the number of unique swim sites in each region. We will use a pipe (|>
) to pass the water_quality
dataset to the group_by
function, and then pass the result to the summarise
function.
|>
water_quality group_by(region) |>
summarise(num_swim_sites = n_distinct(swim_site))
# A tibble: 5 × 2
region num_swim_sites
<chr> <int>
1 Northern Sydney 22
2 Southern Sydney 8
3 Sydney City 11
4 Sydney Harbour 31
5 Western Sydney 7
3b. What is the average water temperature in each region per month, after filtering out readings > 35C? (5)
|>
water_quality filter(water_temperature_c < 35) |> # filter out readings > 35C
group_by(region, month) |>
summarise(avg_water_temp = mean(water_temperature_c, na.rm = TRUE)) |>
ungroup() # this is to remove the grouping and make it usable for other operations
# A tibble: 60 × 3
region month avg_water_temp
<chr> <ord> <dbl>
1 Northern Sydney January 21.7
2 Northern Sydney February 22.3
3 Northern Sydney March 22.0
4 Northern Sydney April 20.8
5 Northern Sydney May 18.9
6 Northern Sydney June 17.2
7 Northern Sydney July 16.2
8 Northern Sydney August 16.3
9 Northern Sydney September 17.3
10 Northern Sydney October 18.5
# ℹ 50 more rows
Extra credit (10)
Using the water quality data filtered for water temperatures ≤ 35C, recreate the following plot. Even if you can’t code them in R yet, write out what data steps, and in what order, you would need to get the data in shape to be able to draw these plots.
This exercise is meant to train you to (a) reverse-engineer graphs, and (b) conceptualize the data munging steps you need to get to these visualization
First we begin by simply filtering out water temperature greater than 35C which also removes obervations with NA, and created a new filtered dataframe to use. From there I used the group_by and summarise function to get mean water temperature for every month / year in the dataset. This then allowed me use to use ggplot and facet_wrap to create this graphic showing mean monthly temperature every time, in every month. Made my lines red, otherwise it would look identical to yours.
# filter out and create new df
<- water_quality |>
water_quality_filtered filter(water_temperature_c <= 35)
# got the mean temp for each month year which allows for much less crowded graphs
<- water_quality_filtered |>
water_quality_summary group_by(year, month) |>
summarise(mean_temp = mean(water_temperature_c, na.rm = TRUE))
ggplot(water_quality_summary, aes(x = year, y = mean_temp)) +
geom_line(color = "red") +
facet_wrap(~month, nrow = 3)+
ylab("Average monthly water temperature (C)")+
xlab("")