Ecotourism Tutorial

When do glowworms glow? Joining wildlife, weather and tourism data

Author

Vajinder Kaur

🌿 The question

Imagine you work for a small ecotourism company in Queensland.

A client asks: “When is the best time to see glowworms?”

Glowworms lighting up a cave in Australia

To answer that, you need to answer three smaller questions first.

Where are glowworms actually spotted? What is the weather like there? And how busy is the region with other tourists?

This tutorial helps you find those answers.

🎯 Objectives

By the end of this tutorial you will be able to:

  • Filter wildlife occurrence records by region and month
  • Join occurrence data to daily weather using ws_id and date
  • Join occurrence data to quarterly tourism counts
  • Spot and explain missing data in a real dataset
  • Draw careful conclusions from incomplete information

Bonus: You will also learn to go online and check whether what the data shows matches what is actually true about glowworms in the wild.

🔧 Preparation

Install the ecotourism package if you have not already:

# install.packages("pak")
pak::pak("vahdatjavad/ecotourism")

Then load everything you need:

library(ecotourism)
library(tidyverse)
library(lubridate)
library(plotly)

We will work with four datasets from the package:

Dataset Description
glowworms Glowworm sighting records across Australia (2014-2024)
weather Daily weather observations by station
tourism_quarterly Quarterly overnight trip counts by region and purpose (trips in thousands)
tourism_region Tourism region names and locations

All four datasets share one variable: ws_id. This is the weather station ID that links a sighting to its nearest weather station and tourism region. You will use it in every join.

🔍 Meet the data

Before we answer any questions, let us look at what we are working with.

The organisms

The ecotourism package has four wildlife datasets. Why did we choose glowworms?

# Compare all organisms -- size, weather coverage, tourism coverage
list(
  `Gouldian Finch` = gouldian_finch,
  `Manta Rays`     = manta_rays,
  `Glowworms`      = glowworms,
  `Orchids`        = orchids
) |>
  purrr::map_dfr(function(df) {
    # Aggregate tourism then join both datasets
    tq <- tourism_quarterly |>
      group_by(ws_id, year, quarter) |>
      summarise(total_trips = sum(trips, na.rm = TRUE), 
                .groups = "drop")
    df_full <- df |>
      mutate(quarter = quarter(date)) |>
      left_join(tq, by = c("ws_id", "year", "quarter")) |>
      left_join(weather |> select(ws_id, date, max),
                by = c("ws_id", "date"))
    tibble(
      Sightings          = nrow(df),
      `Weather coverage` = paste0(round(mean(!is.na(df_full$max))         * 100, 1), "%"),
      `Tourism coverage` = paste0(round(mean(!is.na(df_full$total_trips)) * 100, 1), "%"),
      States             = n_distinct(df$obs_state)
    )
  }, .id = "Organism") |>
  kable() |>
  kable_styling(full_width = FALSE)
Organism Sightings Weather coverage Tourism coverage States
Gouldian Finch 3922 54% 50.6% 7
Manta Rays 953 75.1% 4.9% 5
Glowworms 124 8.9% 39.5% 3
Orchids 35052 7.5% 36.2% 1

Orchids and Gouldian Finches have too many records for a focused tutorial. Manta rays have great weather data but almost no tourism matches. That makes a three-way join nearly impossible.

Glowworms sit in a sweet spot. Small enough to explore comfortably. Good enough tourism coverage for a meaningful join. Glowworms have the least weather data of any organism. That turns out to be useful for teaching.

There is one more reason. Glowworms are bioluminescent. They glow after dark to attract prey. That means the hour of a sighting carries real biological meaning. A sighting at 2am tells a very different story than one at 2pm. No other organism in this package gives you that.

The sightings

Let us look at the key variables in the glowworm dataset:

tibble(
  Variable    = c("obs_lat", "obs_lon", "date", 
                  "hour", "obs_state", "ws_id"),
  Description = c(
    "Latitude of sighting",
    "Longitude of sighting",
    "Date of sighting",
    "Hour of day (0-23)",
    "Australian state",
    "Nearest weather station ID"
  )
) |>
  kable() |>
  kable_styling(full_width = FALSE)
Variable Description
obs_lat Latitude of sighting
obs_lon Longitude of sighting
date Date of sighting
hour Hour of day (0-23)
obs_state Australian state
ws_id Nearest weather station ID

Three things to notice. First, ws_id links each sighting to its nearest weather station. Second, hour tells us what time of day the sighting was recorded. Third, obs_state tells us which Australian state it was in.

Why Queensland and November?

We compared all three states before choosing:

glowworms |>
  count(obs_state, sort = TRUE) |>
  kable() |>
  kable_styling(full_width = FALSE)
obs_state n
Queensland 61
Tasmania 40
New South Wales 23

Queensland has the most sightings. But here is what made us choose November specifically:

glowworms |>
  filter(obs_state == "Queensland") |>
  mutate(
    time_of_day = case_when(
      hour >= 20 | hour <= 4 ~ "Night (8pm-4am)",
      hour >= 17              ~ "Evening (5pm-8pm)",
      TRUE                    ~ "Daytime"
    )
  ) |>
  count(month, time_of_day) |>
  complete(month = 1:12, time_of_day, fill = list(n = 0)) |>
  mutate(month_name = month.abb[month]) |>
  plot_ly(
    x      = ~month_name,
    y      = ~n,
    color  = ~time_of_day,
    type   = "scatter",
    mode   = "lines+markers"
  ) |>
  layout(
    xaxis  = list(title = "Month", categoryorder = "array",
                  categoryarray = month.abb),
    yaxis  = list(title = "Number of sightings"),
    legend = list(title = list(text = "Time of day")),
    margin = list(t = 20)
  )

 

Night sightings peak in November. The blue line rises sharply while daytime sightings stay flat. That matches what we know about glowworm biology.

Your bonus task: Go online and check. Are glowworms in Queensland really most active in November? What does the Atlas of Living Australia say?

We also noticed Tasmania has surprisingly good data coverage in November. We focus on Queensland here. But Tasmania is waiting for you in the extensions at the end.

📥 Exercises

Exercise 1 — Where is the weather?

You are planning glowworm tours in Queensland in November.

Your first instinct is to check the weather. Warm humid nights are best for glowworm activity. So you join the sighting records to the weather data.

But something unexpected happens.

Your tasks:

  1. Filter glowworms to Queensland sightings in November.
  2. Join to weather using ws_id and date.
  3. Count how many sightings have a non-missing max temperature.
  4. What proportion of sightings have no weather data at all?
  5. Why do you think weather data is missing for these locations?
Tip

Before you start: Look at the ws_id values in your filtered data. Then look at the ws_id values in weather. Do they overlap?

Exercise 2 — Who is visiting?

Weather data is missing. But tourism data might still help.

Before you join, there is one thing to sort out. The tourism_quarterly dataset has two rows per region per quarter. One for Holiday trips and one for Business trips. If you join directly to glowworm sightings you will get duplicate rows. One sighting will appear twice.

Try it and see:

glow_qld_nov |>
  left_join(tourism_quarterly, by = c("ws_id", "year")) |>
  nrow()
[1] 76

More rows than sightings. That is the many-to-many problem.

Think of it this way. You have 12 glowworm sightings. Each has a station ID. The tourism table has two rows per station. One for Holiday, one for Business. When you join, each sighting matches both rows. 12 sightings become 24 rows. Your data doubled without adding anything real.

The fix is simple. Before joining, collapse those two rows into one. Add Holiday and Business trips together to get one total_trips number per station per quarter. Now each sighting matches exactly one tourism row.

Your tasks:

  1. Aggregate tourism_quarterly by summing Holiday and Business trips into total_trips per ws_id, year and quarter.
  2. Add a quarter column to glow_qld_nov.
  3. Join your aggregated tourism data to glow_qld_nov.
  4. How many sightings have a matching tourism record?
  5. Are glowworm sighting locations Holiday destinations or Business destinations?
Tip

Hint: November is in quarter 4. Use lubridate::quarter(date) to calculate it automatically.

Exercise 3 — Putting it all together

You now know two things about Queensland glowworm sightings in November.

Weather data is completely missing. Tourism data exists for about 42% of sightings.

Now bring everything together into one joined dataset. Then answer the question your client actually asked: when is the best time to see glowworms?

Your tasks:

  1. Start from glow_qld_nov_q (your Queensland November sightings with quarter added).
  2. Join tourism data using tourism_total.
  3. Join weather data using ws_id and date.
  4. How complete is your final joined dataset across all three sources?
  5. Using sightings by year, which years have tourism data and which do not?
  6. Look at the hour of sightings. What time of day are most glowworms spotted in November?
  7. Based on everything you have found, what would you actually tell your client?
Tip

Think before you code: You already created tourism_total in Exercise 2. You do not need to recreate it. Just use it directly.

👌 Finishing up

You set out to answer one question: when is the best time to see glowworms in Queensland?

You ended up learning three things that go beyond that question.

First, a station appearing in a dataset does not mean it has useful data. Always check date ranges and coverage before trusting a join.

Second, the many-to-many problem is not an error. It is a signal that your data needs preparation before joining.

Third, missing data tells a story too. The places where glowworms live in Queensland are remote, off the grid, and far from weather stations. That is part of what makes them special.

Try these extensions:

  • Re-run this entire analysis for Tasmania. Tasmania had 50% weather coverage and 83% tourism coverage in November. Does the story change? Is the client better served by a Tasmania glowworm tour?
  • Try inner_join() instead of left_join() in Exercise 1. How many sightings do you keep? What are you losing?
  • The tourism data only goes to 2022. Sightings go to 2024. What would you need to complete the picture?
  • Go to the Atlas of Living Australia at ala.org.au. Search for glowworms in Queensland. Do the locations match what you found here?