Open link in a new window or tab: Link to Portfolio
Load the packages and data we’ll need for the project
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
forest_fires <- read_csv("forestfires.csv")
## Rows: 517 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): month, day
## dbl (11): X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain, area
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# What columns are in the dataset?
colnames(forest_fires)
## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI" "temp"
## [10] "RH" "wind" "rain" "area"
We know that the columns correspond to the following information:
A single row corresponds to the location of a fire and some
characteristics about the fire itself. Higher water presence is
typically associated with less fire spread, so we might expect the
water-related variables (DMC and rain) to be
related with area.
month and day are character variables, but
we know that there is an inherent order to them. We’ll convert these
variables into factors so that they’ll be sorted into the correct order
when we plot them.
forest_fires %>% pull(month) %>% unique
## [1] "mar" "oct" "aug" "sep" "apr" "jun" "jul" "feb" "jan" "dec" "may" "nov"
forest_fires %>% pull(day) %>% unique
## [1] "fri" "tue" "sat" "sun" "mon" "wed" "thu"
This guided project will assume that Sunday is the first day of the week, but feel free to adjust the levels according to what’s comfortable to you. Ultimately, the levels just help us rearrange the resulting plots in an order that makes sense to us.
month_order <- c("jan", "feb", "mar",
"apr", "may", "jun",
"jul", "aug", "sep",
"oct", "nov", "dec")
dow_order <- c("sun", "mon", "tue", "wed", "thu", "fri", "sat")
forest_fires <- forest_fires %>%
mutate(
month = factor(month, levels = month_order),
day = factor(day, levels = dow_order)
)
We need to create a summary tibble that counts the number of fires
that appears in each month. Then, we’ll be able to use this tibble in a
visualization. We can consider month and day
to be different grouping variables, so our code to produce the tibbles
and plots will look similar.
fires_by_month <- forest_fires %>%
group_by(month) %>%
summarize(total_fires = n())
fires_by_month
## # A tibble: 12 × 2
## month total_fires
## <fct> <int>
## 1 jan 2
## 2 feb 20
## 3 mar 54
## 4 apr 9
## 5 may 2
## 6 jun 17
## 7 jul 32
## 8 aug 184
## 9 sep 172
## 10 oct 15
## 11 nov 1
## 12 dec 9
fires_by_month %>%
ggplot(aes(x = month, y = total_fires)) +
geom_col() +
labs(
title = "Number of forest fires in data by month",
y = "Fire count",
x = "Month"
)
fires_by_dow <- forest_fires %>%
group_by(day) %>%
summarize(total_fires = n())
fires_by_dow
## # A tibble: 7 × 2
## day total_fires
## <fct> <int>
## 1 sun 95
## 2 mon 74
## 3 tue 64
## 4 wed 54
## 5 thu 61
## 6 fri 85
## 7 sat 84
fires_by_dow %>%
ggplot(aes(x = day, y = total_fires)) +
geom_col() +
labs(
title = "Number of forest fires in data by day of the week",
y = "Fire count",
x = "Day of the week"
)
We see a massive spike in fires in August and September, as well as a smaller spike in March. Fires seem to be more frequent on the weekend.
forest_fires_long <- forest_fires %>%
pivot_longer(
cols = c("FFMC", "DMC", "DC",
"ISI", "temp", "RH",
"wind", "rain"),
names_to = "data_col",
values_to = "value"
)
forest_fires_long %>%
ggplot(aes(x = month, y = value)) +
geom_boxplot() +
facet_wrap(vars(data_col), scale = "free_y") +
labs(
title = "Variable changes over month",
x = "Month",
y = "Variable value"
)
We are trying to see how each of the variables in the dataset relate
to area. We can leverage the long format version of the
data we created to use with facet_wrap().
forest_fires_long %>%
ggplot(aes(x = value, y = area)) +
geom_point() +
facet_wrap(vars(data_col), scales = "free_x") +
labs(
title = "Relationships between other variables and area burned",
x = "Value of column",
y = "Area burned (hectare)"
)
It seems that there are two rows where area influences
the scale of the visualization. Let’s make a similar visualization that
excludes these observations so that we can better see how each variable
relates to area.
forest_fires_long %>%
filter(area < 300) %>%
ggplot(aes(x = value, y = area)) +
geom_point() +
facet_wrap(vars(data_col), scales = "free_x") +
labs(
title = "Relationships between other variables and area burned (area < 300)",
x = "Value of column",
y = "Area burned (hectare)"
)