HW 2: R practice

Due Sunday, September 29th at 11:59pm

Author

Aidan Perkins

Instructions

This assignment is to be submitted as a PDF file into Canvas. All questions are worth 5 points each, or as marked. The preferred option is to render this Quarto document to PDF and submit that file. This file is set up to render to PDF by default. If you have trouble with that, you can also render to HTML and print to PDF from your browser. (Change typst in the header lines above to html)

You will fill in the blanks (indicated by ______) in the code chunks below. You will also need to write some code from scratch in a few places. Ensure that you have removed all the underscores, otherwise you will get errors. Once you’re done filling in the blanks, you should be able to run all the code chunks without errors. To make the code chunks run, you will need to change the eval option from false to true in the execute options at the top of the document.

You should install any required packages on your own. If you run into trouble, please reach out to me for help. You should do this outside this document, so the code for installing packages is not included here (You should never include code for installing packages in a Quarto document).

library(tidyverse)
library(rio)

The following dataset provides water quality measurements from various beaches in Sydney, Australia for the period Sept 1991 to April, 2025.

This data was used for a Tidy Tuesday competition. You can get a more detailed description here, The inspiration behind using this data for the competition was this article

water_quality = rio::import('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-05-20/water_quality.csv')

Data dictionary

variable	class	description
region	character	Area of Sydney City
council	character	City council responsible for water quality
swim_site	character	Name of beach/swimming location
date	date	Date
time	time	Time of day
enterococci_cfu_100ml	integer	Enterococci bacteria levels in colony forming units (CFU) per 100 millilitres of water
water_temperature_c	integer	Water temperature in degrees Celsius
conductivity_ms_cm	integer	Conductivity in microsiemens per centimetre
latitude	double	Latitude
longitude	double	Longitude

We will start exploring this dataset using R.

Structure of the dataset

1a. How many rows and columns are in the dataset? (5)

dim (water_quality)

[1] 123530     10

1b. What are the variable names in the dataset? (5)

colnames(water_quality)

1c. What are the data types of each variable? (5)

You can use the glimpse function from the dplyr package to get a quick overview of the data types of each variable. The dplyr package is part of the tidyverse collection of packages, which we loaded at the start of this assignment.

glimpse(water_quality)

Rows: 123,530
Columns: 10
$ region                <chr> "Western Sydney", "Sydney Harbour", "Sydney Harb…
$ council               <chr> "Hawkesbury City Council", "North Sydney Council…
$ swim_site             <chr> "Windsor Beach", "Hayes Street Beach", "Northbri…
$ date                  <IDate> 2025-04-28, 2025-04-28, 2025-04-28, 2025-04-28…
$ time                  <chr> "11:00:00", "11:40:00", "10:54:00", "09:28:00", …
$ enterococci_cfu_100ml <int> 620, 64, 160, 54, 720, 230, 120, 280, 60, 100, 1…
$ water_temperature_c   <int> 20, 21, 21, 21, 18, 21, 21, 21, 22, 22, 20, 20, …
$ conductivity_ms_cm    <int> 248, 45250, 48930, 52700, 64, 39140, 4845, 50600…
$ latitude              <dbl> -33.60448, -33.84172, -33.80604, -33.80073, -33.…
$ longitude             <dbl> 150.8170, 151.2194, 151.2228, 151.2748, 150.6979…

Data munging (cleaning and transforming the data)

2a. Are there any missing values in the dataset? If so, which variables have missing values and how many? (5)

Hint: The is.na function returns a TRUE if a value is missing (NA) and FALSE otherwise. In R, TRUE and FALSE are treated as 1 and 0 respectively, so you can use the sum function to count the number of missing values in a variable.

Choose one variable in the dataset and find the number of missing values it has

sum(is.na(water_quality$water_temperature_c))

[1] 75039

You can use the summarise function from the dplyr package to count the number of missing values in each variable in the dataset. The across function allows you to apply a function to multiple columns at once.

You will also need to create an anonymous function with one argument (the variable name) that returns the sum of missing values in that variable. The formal way to do this is \(x) sum(is.na(x)).

water_quality |> 
  summarise(across(everything(), \(x) sum(is.na(x))))

  region council swim_site date  time enterococci_cfu_100ml water_temperature_c
1      0       0         0    0 11192                   307               75039
  conductivity_ms_cm latitude longitude
1              78536        0         0

Read the documentation for summarise and across to understand how they work. You can access the documentation by running ?summarise and ?across in your R console.

2b. Create a new variable called log_enterococci that is the natural logarithm of the enterococci_cfu_100ml variable. (5)

water_quality <- water_quality |> 
  mutate(log_enterococci = log(enterococci_cfu_100ml + 1)) # to avoid the log(0) situation

2c. Create a new variable called month that is the month of the year extracted from the date variable. We want a textual representation of the month (e.g., “January”, “February”, etc.) (5)

See the documentation for lubridate::month for how to do this. You will need to change one of the default options in the lubridate::month function. The lubridate package is part of the tidyverse collection of packages, so you don’t need to load it separately.

water_quality <- water_quality |> 
  mutate(month = lubridate::month(date, label = TRUE, abbr = FALSE))

2d. The maximum value of the recorded water temperature is 1040, which seems unreasonably high. The question is, are measurements at particular beaches being done poorly. Extract observations where the water temperature is above 35 degrees Celsius, and then create a frequency table of the beaches in this filtered dataset (5)

You can use the filter function from the dplyr package to filter the dataset, and then use the count function to create a frequency table of the swim_site variable in the filtered dataset.

water_quality |> 
  filter(water_temperature_c > 35) |> 
  count(swim_site)

                          swim_site n
1                  Bilarong Reserve 1
2                       Bondi Beach 2
3                      Bronte Beach 1
4                         Camp Cove 1
5                    Clovelly Beach 1
6                      Coogee Beach 2
7                   Darling Harbour 1
8                     Edwards Beach 1
9                Gordons Bay (East) 1
10                 Little Bay Beach 1
11                    Malabar Beach 1
12                   Maroubra Beach 1
13 Narrabeen Lagoon (Birdwood Park) 1
14                    Newport Beach 1
15             North Cronulla Beach 1
16            North Narrabeen Beach 1
17             South Maroubra Beach 1
18          South Maroubra Rockpool 1
19                   Tamarama Beach 1
20                      Wanda Beach 1
21                 Warriewood Beach 1

2e. Were the high temperature readings more common in particular years? (5)

water_quality <- water_quality |> 
  mutate(year = lubridate::year(date))

water_quality |>
  filter(water_temperature_c > 35) |>
  count(year)

Data summaries

3a. How many unique swim sites are in each region? (5)

Overall, there are 79 unique swim sites in the dataset. Use the group_by and summarise functions from the dplyr package to find the number of unique swim sites in each region. We will use a pipe (|>) to pass the water_quality dataset to the group_by function, and then pass the result to the summarise function.

water_quality |> 
  group_by(region) |> 
  summarise(num_swim_sites = n_distinct(swim_site))

# A tibble: 5 × 2
  region          num_swim_sites
  <chr>                    <int>
1 Northern Sydney             22
2 Southern Sydney              8
3 Sydney City                 11
4 Sydney Harbour              31
5 Western Sydney               7

3b. What is the average water temperature in each region per month, after filtering out readings > 35C? (5)

water_quality |> 
  filter(water_temperature_c < 35) |> # filter out readings > 35C
  group_by(region, month) |> 
  summarise(avg_water_temp = mean(water_temperature_c, na.rm = TRUE)) |> 
  ungroup() # this is to remove the grouping and make it usable for other operations

# A tibble: 60 × 3
   region          month     avg_water_temp
   <chr>           <ord>              <dbl>
 1 Northern Sydney January             21.7
 2 Northern Sydney February            22.3
 3 Northern Sydney March               22.0
 4 Northern Sydney April               20.8
 5 Northern Sydney May                 18.9
 6 Northern Sydney June                17.2
 7 Northern Sydney July                16.2
 8 Northern Sydney August              16.3
 9 Northern Sydney September           17.3
10 Northern Sydney October             18.5
# ℹ 50 more rows

Extra credit (10)

Using the water quality data filtered for water temperatures ≤ 35C, recreate the following plot. Even if you can’t code them in R yet, write out what data steps, and in what order, you would need to get the data in shape to be able to draw these plots.

This exercise is meant to train you to (a) reverse-engineer graphs, and (b) conceptualize the data munging steps you need to get to these visualization

First we begin by simply filtering out water temperature greater than 35C which also removes obervations with NA, and created a new filtered dataframe to use. From there I used the group_by and summarise function to get mean water temperature for every month / year in the dataset. This then allowed me use to use ggplot and facet_wrap to create this graphic showing mean monthly temperature every time, in every month. Made my lines red, otherwise it would look identical to yours.

# filter out and create new df
water_quality_filtered  <- water_quality |>
 filter(water_temperature_c <= 35)

# got the mean temp for each month year which allows for much less crowded graphs
water_quality_summary <- water_quality_filtered |>
  group_by(year, month) |>
  summarise(mean_temp = mean(water_temperature_c, na.rm = TRUE))


ggplot(water_quality_summary, aes(x = year, y = mean_temp)) +
  geom_line(color = "red") +
  facet_wrap(~month, nrow = 3)+
  ylab("Average monthly water temperature (C)")+
  xlab("")