Exploring Data Through Visualizations: Independent Investigations

XYZ Link to Github Page

Open link in a new window or tab: Link to Portfolio

Load the packages and data we’ll need for the project

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
forest_fires <- read_csv("forestfires.csv")
## Rows: 517 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): month, day
## dbl (11): X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain, area
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The Importance of Forest Fire Data

# What columns are in the dataset?
colnames(forest_fires)
##  [1] "X"     "Y"     "month" "day"   "FFMC"  "DMC"   "DC"    "ISI"   "temp" 
## [10] "RH"    "wind"  "rain"  "area"

We know that the columns correspond to the following information:

A single row corresponds to the location of a fire and some characteristics about the fire itself. Higher water presence is typically associated with less fire spread, so we might expect the water-related variables (DMC and rain) to be related with area.

Data Processing

month and day are character variables, but we know that there is an inherent order to them. We’ll convert these variables into factors so that they’ll be sorted into the correct order when we plot them.

forest_fires %>% pull(month) %>% unique
##  [1] "mar" "oct" "aug" "sep" "apr" "jun" "jul" "feb" "jan" "dec" "may" "nov"
forest_fires %>% pull(day) %>% unique
## [1] "fri" "tue" "sat" "sun" "mon" "wed" "thu"

This guided project will assume that Sunday is the first day of the week, but feel free to adjust the levels according to what’s comfortable to you. Ultimately, the levels just help us rearrange the resulting plots in an order that makes sense to us.

month_order <- c("jan", "feb", "mar",
                 "apr", "may", "jun",
                 "jul", "aug", "sep",
                 "oct", "nov", "dec")

dow_order <- c("sun", "mon", "tue", "wed", "thu", "fri", "sat")

forest_fires <- forest_fires %>%
  mutate(
    month = factor(month, levels = month_order),
    day = factor(day, levels = dow_order)
  )

When Do Most Forest Fires Occur?

We need to create a summary tibble that counts the number of fires that appears in each month. Then, we’ll be able to use this tibble in a visualization. We can consider month and day to be different grouping variables, so our code to produce the tibbles and plots will look similar.

Month Level

fires_by_month <- forest_fires %>%
  group_by(month) %>%
  summarize(total_fires = n())

fires_by_month
## # A tibble: 12 × 2
##    month total_fires
##    <fct>       <int>
##  1 jan             2
##  2 feb            20
##  3 mar            54
##  4 apr             9
##  5 may             2
##  6 jun            17
##  7 jul            32
##  8 aug           184
##  9 sep           172
## 10 oct            15
## 11 nov             1
## 12 dec             9
fires_by_month %>% 
  ggplot(aes(x = month, y = total_fires)) +
  geom_col() +
  labs(
    title = "Number of forest fires in data by month",
    y = "Fire count",
    x = "Month"
  )

fires_by_dow <- forest_fires %>%
  group_by(day) %>%
  summarize(total_fires = n())

fires_by_dow
## # A tibble: 7 × 2
##   day   total_fires
##   <fct>       <int>
## 1 sun            95
## 2 mon            74
## 3 tue            64
## 4 wed            54
## 5 thu            61
## 6 fri            85
## 7 sat            84
fires_by_dow %>% 
  ggplot(aes(x = day, y = total_fires)) +
  geom_col() +
  labs(
    title = "Number of forest fires in data by day of the week",
    y = "Fire count",
    x = "Day of the week"
  )

We see a massive spike in fires in August and September, as well as a smaller spike in March. Fires seem to be more frequent on the weekend.

Plotting Other Variables Against Time

forest_fires_long <- forest_fires %>% 
  pivot_longer(
    cols = c("FFMC", "DMC", "DC", 
             "ISI", "temp", "RH", 
             "wind", "rain"),
    names_to = "data_col",
    values_to = "value"
  )

forest_fires_long %>% 
  ggplot(aes(x = month, y = value)) +
  geom_boxplot() +
  facet_wrap(vars(data_col), scale = "free_y") +
  labs(
    title = "Variable changes over month",
    x = "Month",
    y = "Variable value"
  )

Examining Forest Fire Severity

We are trying to see how each of the variables in the dataset relate to area. We can leverage the long format version of the data we created to use with facet_wrap().

forest_fires_long %>% 
  ggplot(aes(x = value, y = area)) +
  geom_point() +
  facet_wrap(vars(data_col), scales = "free_x") +
  labs(
    title = "Relationships between other variables and area burned",
    x = "Value of column",
    y = "Area burned (hectare)"
  )

Outlier Problems

It seems that there are two rows where area influences the scale of the visualization. Let’s make a similar visualization that excludes these observations so that we can better see how each variable relates to area.

forest_fires_long %>% 
  filter(area < 300) %>% 
  ggplot(aes(x = value, y = area)) +
  geom_point() +
  facet_wrap(vars(data_col), scales = "free_x") +
  labs(
    title = "Relationships between other variables and area burned (area < 300)",
    x = "Value of column",
    y = "Area burned (hectare)"
  )