#Load the data and activate the tidyverse package

forestfire <-read.csv("forestfires.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

#Inspect the data

dim(forestfire)
## [1] 517  13
colnames(forestfire)
##  [1] "X"     "Y"     "month" "day"   "FFMC"  "DMC"   "DC"    "ISI"   "temp" 
## [10] "RH"    "wind"  "rain"  "area"

The spatial dataset contains 517 observations of 13 variables.

#Explanation of column names: - X and Y: the longitude and altitute (location) - month and day: the time indicators of forest fire incidents - FFMC (Fine Fuel Moisture Code) and DMC (Duff Moisture Code): Indicator of the moisture content of the surrounding environment. For both indicators, a higher level of index is associated with a greater level of forest fire risk. The former is about the moisture content of fine dead fuel, such as leaves, while the latter emphasizes the moisture concent of the decomposed organic materials on the forest floor. - DC (Drought Code): This indicator reflects the dryness of organic materials in the soil layers, which is associated with how deep the fire can burn. A higher BC level is associated with a forest fire that might be more challenging to control. - ISI (Initial Spread Index) captures the potential spreadness of fire, a higher value is related to a greater degree of fire risk - Temperature (Celcius), Relative humidity in percentage (RH), rain (outside rain), area (area of initial burn)

Now, let’s inspect the month and day variables

forestfire%>%pull(month)%>%unique
##  [1] "mar" "oct" "aug" "sep" "apr" "jun" "jul" "feb" "jan" "dec" "may" "nov"
forestfire%>%pull(day)%>%unique
## [1] "fri" "tue" "sat" "sun" "mon" "wed" "thu"

The month and day variables are ranked in the alphabetic order, which is not the way they are usually arranged. We need to add categorical variables that indicates the order

forestfire <- forestfire%>%mutate(
  month_order = case_when(
    month == "jan" ~ 1,
    month == "feb" ~ 2,
    month == "mar" ~ 3,
    month == "apr" ~ 4,
    month == "may" ~ 5,
    month == "jun" ~ 6,
    month == "jul" ~ 7,
    month == "aug" ~ 8,
    month == "sep" ~ 9,
    month == "oct" ~ 10,
    month == "nov" ~ 11,
    month == "dec" ~ 12
  )
)

forestfire <- forestfire%>%mutate(
  day_order = case_when(
    day == "sun" ~ 1,
    day == "mon" ~ 2,
    day == "tue" ~ 3,
    day == "wed" ~ 4,
    day == "thu" ~ 5,
    day == "fri" ~ 6,
    day == "sat" ~ 7,
  )
)

Now, we inspect which month fires happen the most

fires_by_month <- forestfire %>%
  group_by(month_order) %>%
  summarize(total_fires = n())

fires_by_month %>% 
  ggplot(aes(x = month_order, y = total_fires)) +
  geom_col() +
  labs(
    title = "Number of forest fires in data by month",
    y = "Fire count",
    x = "month"
  )+
  geom_text(aes(label = fires_by_month$month_order), vjust=-0.5)+
  scale_x_discrete(
    labels = as.character(fires_by_month$month_order)
  ) +
  theme(plot.title = element_text (hjust = 0.5)
        )
## Warning: Use of `fires_by_month$month_order` is discouraged.
## ℹ Use `month_order` instead.

It looks like August and September are where forest fires take place the most.

Now,let’s inspect which day of the week forest fires appear the most

fires_by_day <- forestfire%>%group_by(day_order)%>%summarize(total_fires = n())

fires_by_day %>%ggplot(aes(x = day_order, y=total_fires))+
  geom_col()+
  labs(
    title = "Forest Fires by Day of the Week",
    y = "Firest Fires",
    x = "Day of the Week"
  )+ geom_text(aes(label = fires_by_day$day_order), vjust=-0.5)+
  scale_x_discrete(
    labels = as.character(fires_by_day$day_order)
  ) +
  theme(plot.title = element_text (hjust = 0.5)
        )
## Warning: Use of `fires_by_day$day_order` is discouraged.
## ℹ Use `day_order` instead.

As opposed to months, days of the week do not have as much of an impact in the amount of forest fires. It seems that the number of fires peaks on Monday and slowly decreases until Thursday, and gradually bounces back later in the week.

In the subsequent analyses, we will focus on months.

Now, we would like to inspect any relationships between the month variable and the environmental indexes

forest_fires_long <- forestfire %>% 
  pivot_longer(
    cols = c("FFMC", "DMC", "DC", 
             "ISI", "temp", "RH", 
             "wind", "rain"),
    names_to = "data_col",
    values_to = "value"
  )

forest_fires_long%>%
  ggplot(aes(x=month_order, y=value))+
  geom_col()+
  facet_wrap(vars(data_col), scale ="free_y")+
  labs(
    title = "Variable changes over month",
    x = "Month",
    y = "Variable value"
  )

All variables except for rain exhibit peaks during August and September, while the rain variable has a low value in August and high value in September.

Next, we would like to investigate if the environmental indexes have something to do with the severity of the fire, which can be reflected by the “Area” variable.

forest_fires_long%>%
  ggplot(aes(x=value, y=area))+
  geom_point()+
  facet_wrap(vars(data_col), scale ="free_x")+
  labs(
    title = "Variable changes over month",
    x = "Index",
    y = "Area of Fire"
  )

It seems there are two outliners in the area variable. Let’s further explore this issue.

forest_fires_long%>%
  ggplot(aes(x=area))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can clearly see two outliner represented by the two horizontal lines on the right.

Now let’s see if we can use a box plot to further explore this

forest_fires_long%>%
  ggplot(aes(x=area))+
  geom_boxplot()

It looks like 300 is a good threshold for outlines

forest_fires_long %>% 
  filter(area < 300) %>% 
  ggplot(aes(x = value, y = area)) +
  geom_point() +
  facet_wrap(vars(data_col), scales = "free_x") +
  labs(
    title = "Relationships between other variables and area burned (area < 300)",
    x = "Indicators",
    y = "Area burned (hectare)"
  )

I see no obvious trends about how each indicator affects the area of a fire. However, there are some interesting observations.

DC: This variable seems to have a positive relationship with the area of fires, as there is a cluster of points with low DC and small areas. and there is a group with high DC and large areas.

DMC: I see all kinds of area variables with all kinds of DMC. No obvious trends.

FFMC: There is a very obvious cluster of data, showing that the majority of fires are associated with high FFMC index. I suspect that there is a threshold level of FFMC above which a fire is much more likely to take place.

ISI: As opposed to FFMC, I see a cluster of low ISI fires. Maybe there is a threshold below which a fire is likely happen.

Rain: Most fires are associated with low levels of precipitation, as expected.

RH: No obvious trend or cluster, but I do see some high-area fires with low RH values

temp: No obvious trend or cluster, but there are some high-tem and high-area fires

wind: No obvious trend or cluster.