Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective
The original data visualization is intended to show wildfire events in California which burnt more than 300 acres from 2000 to 2017. It is trying to give the audiences a glimpse into the future under global warming, where wildfires – which used to be rare events – are becoming more common. The visualization is supposed to show the number of acres burnt and the duration of each wildfire event each year. Each triangle is a fire event, where the height of the triangle represents the acres burnt and color intensity represent duration. The data visualization is mainly targeting environmental activists, concerned governmental departments, farmers, and people who live in rural areas. These audiences are most concerned as fire events may touch their life directly. Thus, it is very important to communicate these data accurately.
The visualisation chosen had the following three main issues:
The visualization failed to show the intensity of wildfires accurately as audience would want to see. Each wildfire event is plotted as a triangle with same depth for all events at the starting day of fire. The tringle shape doesn’t take into account the duration of the wildfire, giving an impression that the fire was very intensive for some cases. Some fire events span for up to 2 months, it doesn’t make sense to plot such events in single day only, it would appear as a spike in the graph with an intensive color.
The visualization plots all events on top of each other which lead to smaller events to be covered, becoming insignificant, failing to represent the number of acres burnt at a given time. In real life these small events may be geographically close to each other or to a larger event which means they should contribute on the significance of fire events at that time. Look at August 2008 for an example, there are plenty of small events on top of each other, but it fails to show the huge number of acres burnt at that time.
The height of the triangle is supposed to show the number of acres burnt, but the scale legend is inconsistent. It doesn’t represent acres burnt accurately. A 100K acres burnt is 10 times larger than 10K burnt, but the height of the 10K triangle is almost half the height of 100K triangle. Furthermore, since it is a triangle, not a rectangle, the scale is even worse, because the area of a triangle is half the area of a rectangle which will misrepresent the size of fire when comparing between two events.
Reference
The following code was used to fix the issues identified in the original.
# add libraries
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(ggridges)
df <- read_csv("https://gist.githubusercontent.com/lazarogamio/d64e0d04b1ce1f2a3bd08db7526fa632/raw/3de009f5e5bdc9a86489f7ec9e181ca574a3021d/axios-calfire-wildfire-data.csv")
#get a copy of df
wildfires <- df
#make sure all columns has correct data type
wildfires$start <- as.Date(wildfires$start, "%m-%d-%Y")
wildfires$end <- as.Date(wildfires$end, "%m-%d-%Y")
#seperate start years, months, days
wildfires <- wildfires %>% separate(start, c("start_year", "start_month", "start_day"), sep = "-", remove = FALSE)
wildfires$start_year <- as.numeric(wildfires$start_year)
wildfires$start_month <- as.numeric(wildfires$start_month)
#add duration column
wildfires <- wildfires %>% mutate(duration = end - start + 1)
wildfires$duration <- as.numeric(wildfires$duration)
#sum up acres per year
acres_per_year <- wildfires %>% group_by(start_year) %>% summarise(sum(acres))
#sum up no of wildfires per year
fires_per_year <- wildfires %>% group_by(start_year) %>% count()
#create dataframe to store acres burnt each day
full_calender <- data.frame(date = seq(as.Date("2000-01-01"),as.Date("2017-12-31"),1), acres_burnt = 0)
for(fire in 1:nrow(wildfires)){
#fetch variables required
acres_burnt <- wildfires[fire, "acres"]
duration <- wildfires[fire, "duration"]
start <- wildfires[fire, "start"]
acres_burnt <- as.numeric(acres_burnt)
duration <- as.numeric(duration)
acres_per_day <- acres_burnt/duration
#store value in new dataframe
new_cal <- data.frame(date = seq(as.Date("2000-01-01"),as.Date("2017-12-31"),1), acres_burnt = 0)
new_cal <- new_cal %>% mutate(acres_burnt = replace(acres_burnt, date >= start & date < start + duration, acres_per_day + acres_burnt))
#concat orignal data
full_calender$acres_burnt <- full_calender$acres_burnt + new_cal$acres_burnt
}
#prepare data for plotting
full_calender$dayOfYear <- as.numeric(format(full_calender$date, "%j"))
full_calender$CommonDate <- as.Date(paste0("2020-",format(full_calender$date, "%j")), "%Y-%j")
full_calender <- full_calender %>% separate(date, c("year"), sep = "-", remove = FALSE)
#create a copy of the dataframe
full_calender2 <- full_calender
full_calender2$year <- as.numeric(full_calender2$year)
mul = 1
#prepare data for geom_ridgeline plot
for(i in 2016:2000){
full_calender2$year[full_calender2$year == i] <- full_calender2$year[full_calender2$year == i] + 26000*mul
mul = mul + 1
}
#plot data to geom_ridgeline
p2 <- ggplot(full_calender2, aes(x = CommonDate, y = year, height = acres_burnt, group = year)) +
#draw grey rectanle to represent months
geom_rect(aes(xmin = as.Date("2020-01-01"),xmax = as.Date("2020-01-31"),ymin = -Inf, ymax = Inf ), fill = "#dadada") +
geom_rect(aes(xmin = as.Date("2020-03-01"),xmax = as.Date("2020-03-31"),ymin = -Inf, ymax = Inf ), fill = "#dadada") +
geom_rect(aes(xmin = as.Date("2020-05-01"),xmax = as.Date("2020-05-31"),ymin = -Inf, ymax = Inf ), fill = "#dadada") +
geom_rect(aes(xmin = as.Date("2020-07-01"),xmax = as.Date("2020-07-31"),ymin = -Inf, ymax = Inf ), fill = "#dadada") +
geom_rect(aes(xmin = as.Date("2020-09-01"),xmax = as.Date("2020-09-30"),ymin = -Inf, ymax = Inf ), fill = "#dadada") +
geom_rect(aes(xmin = as.Date("2020-11-01"),xmax = as.Date("2020-11-30"),ymin = -Inf, ymax = Inf ), fill = "#dadada") +
#draw scale
annotate("text", x = as.Date("2020-12-18"), y =2700+5000, label = "- 5k acres", colour = "#b22121", size = 2, hjust = 0) +
annotate("text", x = as.Date("2020-12-18"), y =2700+10000, label = "- 10k acres", colour = "#b22121", size = 2, hjust = 0) +
annotate("text", x = as.Date("2020-12-18"), y =2700+15000, label = "- 15k acres", colour = "#b22121", size = 2, hjust = 0) +
annotate("text", x = as.Date("2020-12-18"), y =2700+20000, label = "- 20k acres", colour = "#b22121", size = 2, hjust = 0) +
#draw area for each year to represent acres burnt
geom_ridgeline(fill = "#ff4500", alpha=0.4, size = 0.1, color = "#b22121") +
theme(panel.background = element_rect(fill = 'white', colour = 'white')) +
scale_x_date(labels = function(x) format(x, "%b"), breaks = "1 month")+
ylab("") + xlab("") +
ggtitle("California wildfires in Acres, by year") +
theme(plot.title = element_text(size=12, hjust = 0.5)) +
#change scale values to show years
scale_y_continuous(breaks = c(0,26000,52000,78000,104000,130000,156000,182000,208000,234000,260000,286000,312000,338000,364000,390000,416000,442000), labels = c("2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007","2006","2005","2004","2003","2002","2001","2000"))
Data Reference
CAL FIRE, (2020). California Fire [Original Dataset] https://www.fire.ca.gov/imapdata/mapdataall.csv
Gamio, L. / Axios, (2017). California Fire [Compiled Dataset] https://gist.github.com/lazarogamio/d64e0d04b1ce1f2a3bd08db7526fa632
The following plot fixes the main issues in the original.