In this project, I analyzed forest fires data. The goals of this project were to :
The first thing is to load the tidyverse library.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s load and view the data.
forestfires <- read_csv("forestfires.csv")
## Parsed with column specification:
## cols(
## X = col_double(),
## Y = col_double(),
## month = col_character(),
## day = col_character(),
## FFMC = col_double(),
## DMC = col_double(),
## DC = col_double(),
## ISI = col_double(),
## temp = col_double(),
## RH = col_double(),
## wind = col_double(),
## rain = col_double(),
## area = col_double()
## )
View(forestfires)
I want to know when forest fires are most likely to occur so I asked the following questions:
Used mutate() to change the data type of the month and day columns. Used factor to specify a certain order of variables. This will list the months and days in chronological order on my charts.
forestfiresnew <- forestfires %>%
mutate(month = factor(month, levels = c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec")), day = factor(day, levels = c("sun", "mon", "tue", "wed", "thu", "fri", "sat")))
Split the data into group by month and apply the n() function to count the number of observations in each group.
fires_by_month<-forestfiresnew %>%
group_by(month) %>%
summarize(total_fires = n())
Create bar chart of fires by month.
ggplot(data = fires_by_month) +
aes(x = month, y = total_fires) +
geom_bar(stat = "identity", fill = "tomato") +
theme(panel.background = element_rect(fill = "white")) +
labs(title = "Fires by Month", x = "Month", y="Total Fires")
August and September are the months in which forest fires occur the most.
Split the data into group by day and apply the n() function to count the number of observations in each group.
fires_by_day<-forestfiresnew %>%
group_by(day) %>%
summarize(total_fires = n())
Create bar chart of fires by days of the week.
ggplot(data = fires_by_day) +
aes(x = day, y = total_fires) +
geom_bar(stat = "identity", fill = "tan1") +
theme(panel.background = element_rect(fill = "white")) +
labs(title = "Fires by Day of the Week", x = "Day of Week", y= "Total Fires")
Sundays seem to have the most fires, while Wednesdays seem to have the least fires.
Create multiple box plots to visualize distribution of variables by month. The variables are FFMC, DMC, DC, ISI, temp, RH, wind, and rain. You can learn more about the variables here: http://www3.dsi.uminho.pt/pcortez/fires.pdf
fire_month_plot = function(x,y) {
ggplot(data = forestfiresnew) +
aes_string(x = x, y = y) +
geom_boxplot(col = "chocolate4") +
theme(panel.background = element_rect(fill="white")) +
labs(x = "month")
}
x_var <- names(forestfiresnew)[3]
y_var <-names(forestfiresnew)[5:12]
map2(x_var, y_var, fire_month_plot)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
Many of the variables display clear differences among months. The temp and DC are high during summer months.
Create multiple box plots to visualize distribution of variables by day of the week. The variables are FFMC, DMC, DC, ISI, temp, RH, wind, and rain.
fire_day_plot = function(x,y) {
ggplot(data = forestfiresnew) +
aes_string(x = x, y = y) +
geom_boxplot(col = "lightseagreen") +
theme(panel.background = element_rect(fill="white")) +
labs(x = "day")
}
x_var <- names(forestfiresnew)[4]
y_var <-names(forestfiresnew)[5:12]
map2(x_var, y_var, fire_day_plot)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
It looks like the median of each of the variables are consistent across the days of the week.
Create scatter plots to see what can be learned about relationships between the area burned by a forest fire and the following variables.
fire_month_scatter = function(x,y) {
ggplot(data = forestfiresnew) +
aes_string(x = x, y = y) +
geom_point(alpha = 0.5, col="aquamarine4") +
theme(panel.background = element_rect(fill="white"))
}
x_var <- names(forestfiresnew)[5:12]
y_var <- names(forestfiresnew)[13]
map2(x_var, y_var, fire_month_scatter)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
Most of the points are clustered at the bottom. The points representing area are either zero or close to zero. This tells me that the correlation between these variables and forest fire area does not seem to be strong.
I wanted to look at the distribution of the ISI and the FFMC variables. ISI refers to the expected rate of fire spread. FFMC refers to relative wase of ignition and the flammability of fine fuel.
ggplot(data = forestfiresnew) +
aes(x = ISI) +
geom_histogram(binwidth = 25, fill="darkslateblue") +
theme(panel.background = element_rect(fill="white"))
ggplot(data = forestfiresnew) +
aes(x = FFMC) +
geom_histogram(binwidth = 25, fill = "deeppink4") +
theme(panel.background = element_rect(fill="white"))
The ISI histogram shows me that most of the values of the ISI variable are between zero and about thirty.
The FFMC histogram shows me that most of the values of the FFMC variable are between sixty and one-hundred.
Conclusion:
Through this project, I’ve learned that forest fires can be caused by a variety of factors. More exploration of the data needs to occur to really determine exactly which variables have more influence in forest fires.