Introduction

My project is see if forests fire are the same occurence in all the states of Amazon or not. There are 6454 observations in the given data set. Each case represents a forest fire reported.

This dataset report of the number of forest fires in Brazil divided by states. The series comprises the period of approximately 10 years (1998 to 2017). The data were obtained from the official website of the Brazilian government.

Hypothesis

H0- the number of the forest fires in each states are the same

H1- the number of forest fires are different in each states

This is an observational study. The response variable is the city and its categorical. The explanatory variable is the number of forest fire and is numerical.

Conditions to be met.

  1. The dataset has sample size of more than 30.
  2. There is constant variance.
  3. Each variable is independent from one another.

Analysis

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(stringr)

forest_fire<-read.csv("https://raw.githubusercontent.com/Sizzlo/Data-Project-Proposal/master/amazon.csv", sep=",")

dim(forest_fire)
## [1] 6454    5
names(forest_fire)
## [1] "year"   "state"  "month"  "number" "date"
summary(forest_fire)
##       year              state            month          number     
##  Min.   :1998   Rio        : 717   Janeiro  : 541   Min.   :  0.0  
##  1st Qu.:2002   Mato Grosso: 478   Abril    : 540   1st Qu.:  3.0  
##  Median :2007   Paraiba    : 478   Agosto   : 540   Median : 24.0  
##  Mean   :2007   Alagoas    : 240   Fevereiro: 540   Mean   :108.3  
##  3rd Qu.:2012   Acre       : 239   Julho    : 540   3rd Qu.:113.0  
##  Max.   :2017   Amapa      : 239   Junho    : 540   Max.   :998.0  
##                 (Other)    :4063   (Other)  :3213                  
##          date     
##  1998-01-01: 324  
##  1999-01-01: 324  
##  2000-01-01: 324  
##  2001-01-01: 324  
##  2002-01-01: 324  
##  2003-01-01: 324  
##  (Other)   :4510

This groups the total number of forest fire in each state. Then I listed the total of forest fire from highest to lowest. I want to observe the top 10 places with the most forest fires. Using this data, we can be more alert for these areas that are more prone to forest fires. We can see that Matto Grosso have the highest total of forest fire.

fire1<-forest_fire %>% 
  group_by(state) %>% 
  summarise(Total=round(sum(number))) %>% 
  arrange(desc(Total))
  
fire1_top10<-fire1 %>% 
  slice(0:10)

fire1_top10
## # A tibble: 10 x 2
##    state        Total
##    <fct>        <dbl>
##  1 Mato Grosso  96246
##  2 Paraiba      52436
##  3 Sao Paulo    51121
##  4 Rio          45161
##  5 Bahia        44746
##  6 Piau         37804
##  7 Goias        37696
##  8 Minas Gerais 37475
##  9 Tocantins    33708
## 10 Amazonas     30650
ggplot(fire1_top10, aes(x=state, y=Total))+geom_bar(fill="lightblue", stat="identity")

We can also group by which year had the most forest fire. We can see that 2003 have the most forest fire that year. We can also look at the most forest fires within the top 5 years.

fire2<-forest_fire %>% 
  group_by(year) %>% 
  summarise(Total=round(sum(number))) %>% 
  arrange(desc(Total)) %>% 
  slice(0:5)
kable(head(fire2))
year Total
2003 42761
2016 42212
2015 41208
2012 40085
2014 39621
ggplot(fire2, aes(x=year, y=Total))+geom_point(stat="identity")

Tidying up the data to only obtain the states and number of the forest fires.

test1<-forest_fire %>% 
  group_by(state) %>% 
  select(state, number)

test1
## # A tibble: 6,454 x 2
## # Groups:   state [23]
##    state number
##    <fct>  <dbl>
##  1 Acre       0
##  2 Acre       0
##  3 Acre       0
##  4 Acre       0
##  5 Acre       0
##  6 Acre      10
##  7 Acre       0
##  8 Acre      12
##  9 Acre       4
## 10 Acre       0
## # ... with 6,444 more rows

Getting mean of the forest fires of each location

meanfire1<-forest_fire %>% 
  group_by(state) %>% 
  summarise(Total=round(mean(number))) %>% 
  arrange(desc(Total))

meanfire1
## # A tibble: 23 x 2
##    state        Total
##    <fct>        <dbl>
##  1 Sao Paulo      214
##  2 Mato Grosso    201
##  3 Bahia          187
##  4 Goias          158
##  5 Piau           158
##  6 Minas Gerais   157
##  7 Tocantins      141
##  8 Amazonas       128
##  9 Ceara          127
## 10 Paraiba        110
## # ... with 13 more rows

This is a graph of all the means of the forest fires of each state

ggplot(meanfire1, aes(x=state, y=Total))+geom_bar(fill="lightblue", stat="identity")+ labs(y="Average Total") + coord_flip()

Check the variance.

ggplot(test1, aes(x=number, y=state))+geom_point()

Anova function

fire_lm=lm(number~state, data= test1)

anova(fire_lm)
## Analysis of Variance Table
## 
## Response: number
##             Df    Sum Sq Mean Sq F value    Pr(>F)    
## state       22  20105714  913896  27.356 < 2.2e-16 ***
## Residuals 6431 214843574   33407                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

Since the p-value is less than 0.05, we can reject the null hypothesis of the forest fires being the same in each states. We can conclude that not all fires are equal.