California Wildfire Analysis

Data Cleaning:

All rows in active column are False because none of them are active anymore. This information is not particularly useful so Benjamin removed this column. The canonical url column provides the ends of a url, but no mention of the website being references. There is not any real value so Benjamin removed the canonical url column. County id is not particularly helpful since longitude and latitude values are already present. Benjamin removed the County id column. The column featured is mostly false and the column final is mostly true. Fuel type is mostly null. Benjamin removed featured, final and fuel type columns together. The percent contained column is all 100’s since all fires reported have been contained. Benjamin removed the percent contained column. The public data column has all TRUE values, so Benjamin removed it. The status column all finalized value so Benjamin removed it. All of the values in StructuresEvacuated were NA so Benjamin removed it. The UniqueId column makes no sense so Benjamin removed it. The update column provides no useful dates like started and extinguished so Benjamin removed it.

library(readr)
fire_dataset_Cleaning <- read_csv("California_Fire_Incidents.csv")

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 1636 Columns: 40

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): AdminUnit, CanonicalUrl, ConditionStatement, ControlStatement, Co...
## dbl  (17): AcresBurned, AirTankers, ArchiveYear, CrewsInvolved, Dozers, Engi...
## lgl   (8): Active, CalFireIncident, Featured, Final, FuelType, MajorIncident...
## dttm  (3): Extinguished, Started, Updated

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#View(fire_dataset_Cleaning)

fire_dataset_Cleaning <- fire_dataset_Cleaning[,-2]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-6]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-9]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-14:-16]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-21]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-22]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-25]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-27]
fire_dataset_Cleaning <- fire_dataset_Cleaning[,-28:-29]

Benjamin moved the started column to be in front of the established column for clarity. Benjamin moved the name column to be the first column of the dataset for clarity. Benjamin moved the Longitude after the Latitude for clarity.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

fire_dataset_Cleaning <- fire_dataset_Cleaning %>% relocate(Started, .before = Extinguished)
fire_dataset_Cleaning <- fire_dataset_Cleaning %>% relocate(Name, .before = AcresBurned)
fire_dataset_Cleaning <- fire_dataset_Cleaning %>% relocate(Longitude, .before = Location)

In many of the columns, there were both NA values and numeric values. Benjamin decided to replace NA values with 0 since it is unclear what the true value of these observations might be. Please consider airtankers, crewsinvolved, dozers, engines, fatalities, helicopters, injuries, personnel involved, structures damaged, structures destroyed, structures threatened and water tenders with care.

library(tidyr)
summary(fire_dataset_Cleaning$Started)

##                  Min.               1st Qu.                Median 
## "1969-12-31 16:00:00" "2015-08-25 07:39:00" "2017-07-18 14:06:30" 
##                  Mean               3rd Qu.                  Max. 
## "2017-02-12 08:19:16" "2018-07-19 13:56:30" "2019-11-25 19:59:12"

min(fire_dataset_Cleaning$Started)

## [1] "1969-12-31 16:00:00 UTC"

min(fire_dataset_Cleaning$Extinguished)

## [1] NA

summary(fire_dataset_Cleaning$Extinguished)

##                  Min.               1st Qu.                Median 
## "1969-12-31 16:00:00" "2015-08-22 18:00:00" "2018-01-09 11:52:00" 
##                  Mean               3rd Qu.                  Max. 
## "2017-04-03 09:40:55" "2019-01-04 09:33:00" "2019-12-14 08:22:00" 
##                  NA's 
##                  "59"

California Wildfire Shiny Web App:

The following section is intended to interactively visualize the various variables available. The variables that Alejandro wants to focus on are the AcresBurned throughout the four seasons of the year. With the help of module 11’s material, the process of developing and deploying the shiny web app was possible with the use of the shiny and rsconnect packages.

# libraries
#install.packages(c('shiny','shinythemes','thematic','tidyverse','lubridate','rsconnect'))
library(shiny)
library(shinythemes)
library(thematic)
library(tidyverse)
library(lubridate)
library(rsconnect)

Below, Alejandro has defined a UI with two separate tabs within the panel:

The scatter plot will display the result of choosing between three variables, the number of acres burned, fatalities, and injuries. These variables will be options for the x-axis, y-axis, and dot size of the plot.
The bar graph will instead visualize any trends in the number of acres burned, along with other observations, per each year. These will be separated by the season in which they occurred.

# Define UI
ui <- fluidPage(
  theme = shinytheme("yeti"),
  titlePanel("California Forest Fire Analysis"),
  # set up two tab panels
  tabsetPanel(
    tabPanel("Scatter Plot", fluid = TRUE,
             sidebarLayout(
               # sidebar inputs for scatterplot
               sidebarPanel(
                 selectInput("var", "X-axis:",
                             c("Acres Burned" = "AcresBurned",
                               "# of Fatalities" = "Fatalities",
                               "# Injured" = "Injuries"),
                             selected = "AcresBurned"),
                 selectInput("var1", "Y-axis:",
                             c("Acres Burned" = "AcresBurned",
                               "# of Fatalities" = "Fatalities",
                               "# Injured" = "Injuries"),
                             selected = "Injuries"),
                 selectInput("var2", "Dot Size:",
                             c("Acres Burned" = "AcresBurned",
                               "# of Fatalities" = "Fatalities",
                               "# Injured" = "Injuries"),
                             selected = "Fatalities")
              ),
              mainPanel(
                # display scatter plot in main panel
                plotOutput("scatterplot"))
              )
    ),
    tabPanel("Bar Graph", fluid = TRUE,
             sidebarLayout(
               # sidebar inputs for the bar graph
               sidebarPanel(
                 selectInput("var3", "X-axis:",
                             c("Month" = "Month",
                               "Year" = "Year"),
                             selected = "Year"),
                 selectInput("var4", "Y-axis:",
                             c("Acres Burned" = "AcresBurned",
                               "# of Fatalities" = "Fatalities",
                               "# Injured" = "Injuries",
                               "# of Structures Damaged" = "StructuresDamaged",
                               "# of Structures Destroyed" = "StructuresDestroyed"),
                             selected = "AcresBurned")),
               mainPanel(
                 plotOutput("bargraph"))
            )
    )
  )
)

Inside the server function, Alejandro also included some extra wrangling to the data to fit his needs. After this, he proceed to create the scatter plot and bar graph visualizations displaying the inputs available for selection.

# server logic containing data prep and plotting for the app
server <- function(input, output) {
  # run Ben's data cleaning script plus further cleaning for my needs
  fire_df <- cleanFireData() %>%
    select(AcresBurned, CalFireIncident, Started, Extinguished, 
           Fatalities, Injuries, StructuresDamaged, StructuresDestroyed) %>%
    mutate(AcresBurned = as.numeric(AcresBurned),
           Started = as.Date(Started, "%Y-%m-%d"),
           Extinguished = as.Date(Extinguished, "%Y-%m-%d")) %>%
    mutate(Month = month(Started, label=TRUE), Year = year(Started)) %>%
    filter(Year != 1969) %>% # remove Year outliers found in data set
    select(-c(Started, Extinguished))
  
  # convert 0 and NAs to the mean of AcresBurned  
  fire_df$AcresBurned[fire_df$AcresBurned == 0] <- NA
  fire_df$AcresBurned[is.na(fire_df$AcresBurned)] <- mean(fire_df$AcresBurned, 
                                                          na.rm = TRUE)
  # create season variable for scatter plot color use joined with original data frame
  fire_df$Month <- as.numeric(fire_df$Month)
  
  # setting month range by seasons
  spring <- c(03,04,05)
  fall <- c(09,10,11)
  summer <- c(06,07,08)
  winter <- c(12,01,02)
  # seasons df to join with existing df
  seasons <- data.frame(
    Season = c("spring","spring","spring", "fall","fall","fall", 
               "summer","summer","summer", "winter","winter","winter"),
    Month = c(spring, fall, summer, winter)
  )
  # joining seasons by month
  fire_df <- fire_df %>%
    full_join(seasons, fire_df, by = "Month") %>%
    # setting variables to desired format
    mutate(Month = month(Month, label=TRUE),
           Season = as.factor(Season))
  
  # scatter plot showing the desired inputs
  output$scatterplot <- renderPlot(
    ggplot(fire_df, aes(x=fire_df[,input$var], y=fire_df[,input$var1])) +
      geom_point(aes(color=Season, size=fire_df[,input$var2])) +
      labs(title=paste(input$var1, "vs.", input$var),
           x=paste(input$var), y=paste(input$var1), 
           size=paste(input$var2)) +
      scale_x_continuous(labels = scales::comma)
  )
  
  # bar graph of the desired inputs
  output$bargraph <- renderPlot(
    ggplot(fire_df, aes(x=as.factor(fire_df[,input$var3]), 
                        y=fire_df[,input$var4], fill=Season)) +
      geom_bar(position="dodge", stat="identity", width=0.7) +
      labs(title=paste(input$var4, "by", input$var3),
           x=paste(input$var3), y=paste(input$var4)) +
      scale_y_continuous(labels = scales::comma)
  )
}

# set plot theme to minimal
theme_set(theme_minimal())
# have the shiny theme match the ggplot theme
thematic_shiny()

To view the app, click this link.

The shiny app shows that there may not be a correlation with the season how intense a California wildfire can get. Although there is a noticeable trend during the summer season throughout the years 2013-2019, the wildfires during the fall season in 2017 says otherwise.

Training a Model for Correlation:

#Using my data:
library(dplyr)
library(e1071)
library(kernlab)
library(ggplot2)
library(Metrics)
#Cleaning the data to only include the columns I need
data <- read.csv("California_Fire_incidents.csv")
data <- data %>%
        select(AcresBurned, ArchiveYear, CalFireIncident, Started, MajorIncident)
data$CalFireIncident <- as.factor(data$CalFireIncident)


#Removing the the year and day from the date so that I can only use month
data$month <- strftime(data$Started,"%m")
data$day <- strftime(data$Started, "%j")
data$month <- as.numeric(data$month)
data$day <- as.numeric(data$day)

#season column
data$seasons <- as.numeric(data$month)

#Creating a separate df for seasons
spring <- c(03, 04, 05)
fall <- c(09, 10, 11)
summer <- c(06, 07, 08)
winter <- c(12, 01, 02)

seasons <- data.frame(
  seasons = c("spring","spring","spring", "fall","fall","fall", "summer","summer","summer", "winter","winter","winter"),
  month = c(spring, fall, summer, winter)
)

data <- data %>%
        full_join(seasons, data, by = "month")

Upon looking at the original data frame, wanting to see if there were a correlation between the seasons and the amount of acres burned. In theory dry conditions in the summer would cause more fires due to the intense heat that California gets during the summer.

Using Ben’s initial cleaning of the original data set, more cleaning was needed in order for Brandon to obtain the seasons for the months:

#Creating final dataset
data_fin <- data %>%
            select(AcresBurned, ArchiveYear, CalFireIncident, "season" = seasons.y, month, day, Started, MajorIncident) %>%
            filter(CalFireIncident == "True")
data_fin$Started <- as.Date(data_fin$Started)
data_fin$Day <- strftime(data_fin$Started,"%m-%d")
data_fin$MajorIncident <- as.factor(data_fin$MajorIncident)
data_fin$AcresBurned <- as.numeric(data_fin$AcresBurned)
data_fin$season <- as.factor(data_fin$season)
data_fin$month <- as.factor(data_fin$month)
data_fin$AcresBurned[data_fin$AcresBurned == 0] <- NA
data_fin$AcresBurned[is.na(data_fin$AcresBurned)] <- mean(data$AcresBurned, na.rm = TRUE)
head(data_fin)

##   AcresBurned ArchiveYear CalFireIncident season month day    Started
## 1      257314        2013            True summer     8 229 2013-08-17
## 2       30274        2013            True spring     5 150 2013-05-30
## 3       27531        2013            True summer     7 196 2013-07-15
## 4       24251        2013            True spring     5 122 2013-05-02
## 5       20292        2013            True summer     8 219 2013-08-07
## 6       11429        2013            True summer     8 235 2013-08-23
##   MajorIncident   Day
## 1         False 08-17
## 2         False 05-30
## 3         False 07-15
## 4          True 05-02
## 5          True 08-07
## 6          True 08-23

By adding the seasons, Brandon’s correlational analysis would be easier, due to his predictive model only predicting the four seasons, rather than the 12 months. In theory, Brandon’s predictive model using a support vector machine, if his predictive model could accurately predict the seasons using the Acres Burned, whether the fire was considered major, and if it was reported by the California wildfire team, an accurate predictive model would mean that the size of the fire has a correlation between the time of year, and in our case which season of the year.

Using ggplot, we could find if there are any correlations to be made:

ggplot(data_fin, aes(x = Day, 
                     y = AcresBurned, 
                     color = season)) +
                geom_point() +
                geom_jitter() +
                theme(axis.text.x=element_blank()) + 
                ylim(0, 5000) +
                labs(x = "Entire Year in Days")

## Warning: Removed 83 rows containing missing values (geom_point).

## Warning: Removed 83 rows containing missing values (geom_point).

Considering that the days are chronological in order due to the year, the plot will be shown accordingly. With the division of the plot with the different colors showing the different seasons, there is not potential for a linear model to show a prediction, however a predictive model using classification would have a better fit considering that the column we are predicting is a factor with 4 levels: Fall, Winter, Spring, and Summer.

Using a random sample that we learned in module 10, we can create a test and training set which we will use for our analysis and model:

##    AcresBurned season MajorIncident
## 1       257314 summer         False
## 3        27531 summer         False
## 4        24251 spring          True
## 7         8073   fall          True
## 12        3505   fall          True
## 14        3111   fall          True

##     AcresBurned season MajorIncident
## 415          83 summer          True
## 463          28 summer          True
## 179         101 summer         False
## 526        1044 summer          True
## 195          57 spring         False
## 938          75 summer          True

Going back to Brandon’s analysis, the use of a support vector machine would be used to see if there is a significance of the stated vectors in order for the algorithm to accurately predict the season, if the model could predict the seasons with high accuracy then there would be a significance in season to the size of the fire.

Using the svm() function from the e1071 library, we could use the smv algorithm to predict the seasons using classification.

#I want to see if my model could predict the season using the variables
svm_model <- svm(season ~., data = train)
svm_model

## 
## Call:
## svm(formula = season ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  586

## [1] "Model's Accuracy: 64.1%"

Considering that the model’s accuracy is ~64%, and the predicted output only gives out summer, the low accuracy shows that the only correlation the model can will outfit the summer factor, meaning that there is no correlation between the seasons and size of fire.

Brandon Dora’s Part:

### Brandon Dora's Density plot of the Fires per County in California. 


### Importing Library 
library(readr)
library(ggmap)

## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.

## Please cite ggmap if you use it! See citation("ggmap") for details.

library(ggplot2)
library(tibble)
library(tidyverse)
library(zipcode)
library(remotes)

Density Plot and State Map showing fire occurences in the State of California.

For the portion below, Brandon Dora had decided to create a density map similar to the ones that we had created prior in week 4 of Introduction to Data Mining.

#implementing Ben's cleaned dataset using alejandro's inner portion of his function
 testFrame3 <- read.csv("California_Fire_Incidents.csv")
  
  testFrame3 <- testFrame3[,-2]
  
  testFrame3 <- testFrame3[,-6]
  
  testFrame3 <- testFrame3[,-9]
  
  testFrame3 <- testFrame3[,-14:-16]
  
  testFrame3 <- testFrame3[,-21]
 
  testFrame3 <- testFrame3[,-22]

  testFrame3 <- testFrame3[,-25]
  
  summary(testFrame3$StructuresEvacuated)

##    Mode    NA's 
## logical    1636

  testFrame3 <- testFrame3[,-27]
  
  summary(testFrame3$StructuresThreatened)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0    45.0   522.8  1043.8  2600.0    1606

  testFrame3 <- testFrame3[,-28:-29]
  
  testFrame3 <- testFrame3 %>% relocate(Started, .before = Extinguished)
  
  testFrame3 <- testFrame3 %>% relocate(Name, .before = AcresBurned)
  
  testFrame3 <- testFrame3 %>% relocate(Longitude, .before = Location)
  
  testFrame3 <- replace_na(testFrame3, list(AirTankers = 0, CrewsInvolved = 0, Dozers = 0, Engines = 0, Fatalities = 0, Helicopters = 0, Injuries = 0, PersonnelInvolved = 0, StructuresDamaged = 0, StructuresDestroyed = 0, StructuresThreatened = 0, WaterTenders = 0))


## creating and cleaning data frames to map out the state of california
  
states <- map_data("state")
California <- subset(states, region == "california")
California$Longitude <- California$long
California$Latitude <- California$lat
California$long = NULL
California$lat = NULL
California$region <- str_to_title(California$region)
counties <- map_data("county")
californiaCounties <- subset(counties, region == "california")
californiaCounties$Counties <- str_to_title(californiaCounties$subregion)
californiaCounties$subregion = NULL
californiaCounties$Longitude <- californiaCounties$long
californiaCounties$Latitude <- californiaCounties$lat
californiaCounties$long = NULL
californiaCounties$lat = NULL

### creating a new dataframe based on our original to focus on the counties, and the number of times a fire has occured in each of them.

The reasoning behind creating a new dataframe for this, was to be able to seamlessly consolidate the name’s of each county in California, while also being able to log the number of times a fire had occured in each county. This way, we are able to join our dataframe, with the newly created “californiaCounties” dataframe listed above. There are many other ways to do this, but for the method he had chosen it was simply because it made for a much more seamless process, and turned out to be more time effective.

testFrame2 <- table(testFrame3$Counties)
testFrame2

## 
##         Alameda          Alpine          Amador           Butte       Calaveras 
##              32               2              13              66              22 
##          Colusa    Contra Costa       Del Norte       El Dorado          Fresno 
##               6              27               6              37              57 
##           Glenn        Humboldt            Inyo            Kern           Kings 
##               8              25              12              62               5 
##            Lake          Lassen     Los Angeles          Madera           Marin 
##              49              36              46              36               6 
##        Mariposa       Mendocino          Merced          Mexico           Modoc 
##              35              28              16               2              31 
##            Mono        Monterey            Napa          Nevada          Orange 
##              10              45              25              17              10 
##          Placer          Plumas       Riverside      Sacramento      San Benito 
##              17              11             146              11              20 
##  San Bernardino       San Diego     San Joaquin San Luis Obispo       San Mateo 
##              53              89               8              64               3 
##   Santa Barbara     Santa Clara      Santa Cruz          Shasta          Sierra 
##              29              39               4              64               2 
##        Siskiyou          Solano          Sonoma      Stanislaus State of Nevada 
##              57              19              18              20               1 
## State of Oregon          Sutter          Tehama         Trinity          Tulare 
##               1               3              51              21              35 
##        Tuolumne         Ventura            Yolo            Yuba 
##              22              30              12              14

testFrame2 <- as.matrix(testFrame2)
testFrame2 <- as.data.frame(testFrame2)
testFrame2$Counties <- c("Alameda", "Alpine", "Amador", "Butte", "Calaveras", "Colusa", "Contra Costa", "Del Norte", "El Dorado", "Fresno", "Glenn", "Humboldt", "Inyo", "Kern", "Kings", "Lake", "Lassen", "Los Angeles", "Madera", "Marin", "Mariposa", "Mendocino", "Merced", "Mexico", "Modoc", "Mono", "Monterey", "Napa", "Nevada", "Orange", "Placer", "Plumas", "Riverside", "Sacramento", "San Benito", "San Bernardino", "San Diego", "San Joaquin", "San Luis Obispo", "San Mateo", "Santa Barbara", "Santa Clara", "Santa Cruz", "Shasta", "Sierra", "Siskiyou", "Solano", "Sonoma", "Stanislaus", "State of Nevada", "State of Oregon", "Sutter", "Tehama", "Trinity", "Tulare", "Tuolumne", "Ventura", "Yolo", "Yuba")
testFrame2$FireCount <- testFrame2$V1
testFrame2$V1 = NULL


### combining our new test frame with our county dataframe. 
californiaCounties <- left_join(californiaCounties, testFrame2, by="Counties")
californiaCounties[is.na(californiaCounties)] = 0

As shown above, joining the two dataframes together allowed for us to easily combine the data we needed, while not having any complications whatsoever.

# creating our state map
caliFire <- ggplot(data = California, mapping = aes(x = Longitude, y = Latitude, group = group))
caliFire <- caliFire + geom_polygon(data = californiaCounties, color = "white") + geom_polygon(color = "black", fill = NA)
caliFire <- caliFire + coord_fixed(1.3)

### the state map sectioned by counties
caliFire

The reasoning behind plotting a map displaying counties first, was to get a good idea of what we are working with, while also creating a base that we will be placing our density map on top of.

### adding our data and creating our 2d density map on top our previously created map.


caliFire <- caliFire + geom_polygon(data = californiaCounties, aes(fill = FireCount))
caliFire<- caliFire + ggtitle("Fire Count in California by County")
caliFire2d <- caliFire + stat_density2d()
caliFire2d

As you can see, our density map is properly displaying our Two-Dimensional density map, where we can easily see by county where more often than not fires had occurred.

Conclusions

The wildfire analysis gave many insights into looking at what was going on in California from 2013-2019.

Alejandro was able to utilize the data in order to make an interactive dashboard for his collection, the dashboard can be found here. Using Alejandro’s dashboard we were able to have an insight as to the cummulative acres burned per each year as seen below:

This bar graph that is also on Alejandro’s dashboard showed insight to the size of the fires per each year with emphasis on seasons. As we can see here it is really difficult to determine any patterns or trends among the seasons per each year, which gives an idea that maybe California’s weather has no direct effect on the time of year or season.

By figuring this out, we can transition to Brandon Ly’s analysis in finding if there is a correlation between the season and acres burned.

## Warning: Removed 83 rows containing missing values (geom_point).

## Warning: Removed 83 rows containing missing values (geom_point).

Alike Alejandro’s bar plot, Brandon’s graph which plots the days of the year along the points showing the size of the fire has a insignificant outlook alike Alejandro’s bar plot. The points are evenly scattered which point the direction that climate and season had no effect on the size of a wildfire in California.

Through the use of support vector machines, Brandon found that his model could only predict the outcomes of the seasons with a 64%, meaning that there is not a correlation between the two. By using Brandon Ly’s and Alejandro’s visualization it makes sense that Brandon Ly’s model could not accuretly predict the season from the acres burned because there is no correlation between the two.

We can take a further look in the details of the fires using Brandon Dora’s visualization.

Using Brandon Dora’s density map, it shows that the most fires that occurred in California were scattered among the bay area, and down towards the west coast of California. Again, with the knowledge from Brandon Ly’s and Alejandro’s findings it makes sense that the season has no effect on the size of fires, because most of the fires occurred in the spots where the weather remains consistent.

This shows that wildfires in California occur simply due to the natural climate California faces each year, and that seasons or time of years do not have a direct play in how the fires start and occur.