My project is see if forests fire are the same occurence in all the states of Amazon or not. There are 6454 observations in the given data set. Each case represents a forest fire reported.
This dataset report of the number of forest fires in Brazil divided by states. The series comprises the period of approximately 10 years (1998 to 2017). The data were obtained from the official website of the Brazilian government.
H0- the number of the forest fires in each states are the same
H1- the number of forest fires are different in each states
This is an observational study. The response variable is the city and its categorical. The explanatory variable is the number of forest fire and is numerical.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(stringr)
forest_fire<-read.csv("https://raw.githubusercontent.com/Sizzlo/Data-Project-Proposal/master/amazon.csv", sep=",")
dim(forest_fire)
## [1] 6454 5
names(forest_fire)
## [1] "year" "state" "month" "number" "date"
summary(forest_fire)
## year state month number
## Min. :1998 Rio : 717 Janeiro : 541 Min. : 0.0
## 1st Qu.:2002 Mato Grosso: 478 Abril : 540 1st Qu.: 3.0
## Median :2007 Paraiba : 478 Agosto : 540 Median : 24.0
## Mean :2007 Alagoas : 240 Fevereiro: 540 Mean :108.3
## 3rd Qu.:2012 Acre : 239 Julho : 540 3rd Qu.:113.0
## Max. :2017 Amapa : 239 Junho : 540 Max. :998.0
## (Other) :4063 (Other) :3213
## date
## 1998-01-01: 324
## 1999-01-01: 324
## 2000-01-01: 324
## 2001-01-01: 324
## 2002-01-01: 324
## 2003-01-01: 324
## (Other) :4510
This groups the total number of forest fire in each state. Then I listed the total of forest fire from highest to lowest. I want to observe the top 10 places with the most forest fires. Using this data, we can be more alert for these areas that are more prone to forest fires. We can see that Matto Grosso have the highest total of forest fire.
fire1<-forest_fire %>%
group_by(state) %>%
summarise(Total=round(sum(number))) %>%
arrange(desc(Total))
fire1_top10<-fire1 %>%
slice(0:10)
fire1_top10
## # A tibble: 10 x 2
## state Total
## <fct> <dbl>
## 1 Mato Grosso 96246
## 2 Paraiba 52436
## 3 Sao Paulo 51121
## 4 Rio 45161
## 5 Bahia 44746
## 6 Piau 37804
## 7 Goias 37696
## 8 Minas Gerais 37475
## 9 Tocantins 33708
## 10 Amazonas 30650
ggplot(fire1_top10, aes(x=state, y=Total))+geom_bar(fill="lightblue", stat="identity")
We can also group by which year had the most forest fire. We can see that 2003 have the most forest fire that year. We can also look at the most forest fires within the top 5 years.
fire2<-forest_fire %>%
group_by(year) %>%
summarise(Total=round(sum(number))) %>%
arrange(desc(Total)) %>%
slice(0:5)
kable(head(fire2))
| year | Total |
|---|---|
| 2003 | 42761 |
| 2016 | 42212 |
| 2015 | 41208 |
| 2012 | 40085 |
| 2014 | 39621 |
ggplot(fire2, aes(x=year, y=Total))+geom_point(stat="identity")
Tidying up the data to only obtain the states and number of the forest fires.
test1<-forest_fire %>%
group_by(state) %>%
select(state, number)
test1
## # A tibble: 6,454 x 2
## # Groups: state [23]
## state number
## <fct> <dbl>
## 1 Acre 0
## 2 Acre 0
## 3 Acre 0
## 4 Acre 0
## 5 Acre 0
## 6 Acre 10
## 7 Acre 0
## 8 Acre 12
## 9 Acre 4
## 10 Acre 0
## # ... with 6,444 more rows
Getting mean of the forest fires of each location
meanfire1<-forest_fire %>%
group_by(state) %>%
summarise(Total=round(mean(number))) %>%
arrange(desc(Total))
meanfire1
## # A tibble: 23 x 2
## state Total
## <fct> <dbl>
## 1 Sao Paulo 214
## 2 Mato Grosso 201
## 3 Bahia 187
## 4 Goias 158
## 5 Piau 158
## 6 Minas Gerais 157
## 7 Tocantins 141
## 8 Amazonas 128
## 9 Ceara 127
## 10 Paraiba 110
## # ... with 13 more rows
This is a graph of all the means of the forest fires of each state
ggplot(meanfire1, aes(x=state, y=Total))+geom_bar(fill="lightblue", stat="identity")+ labs(y="Average Total") + coord_flip()
ggplot(test1, aes(x=number, y=state))+geom_point()
fire_lm=lm(number~state, data= test1)
anova(fire_lm)
## Analysis of Variance Table
##
## Response: number
## Df Sum Sq Mean Sq F value Pr(>F)
## state 22 20105714 913896 27.356 < 2.2e-16 ***
## Residuals 6431 214843574 33407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than 0.05, we can reject the null hypothesis of the forest fires being the same in each states. We can conclude that not all fires are equal.