Data Project Proposal

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Are some cities more susceptible to forest fire? ## Cases What are the cases, and how many are there? Each case represents a forest fire reported. There are 6454 observations in the given data set. ## Data collection Describe the method of data collection. This dataset report of the number of forest fires in Brazil divided by states. The series comprises the period of approximately 10 years (1998 to 2017). The data were obtained from the official website of the Brazilian government. ## Type of study What type of study is this (observational/experiment)? This is an observational study. ## Data Source If you collected the data, state self-collected. If not, provide a citation/link.

http://dados.gov.br/dataset/sistema-nacional-de-informacoes-florestais-snif ## Response What is the response variable, and what type is it (numerical/categorical)? The response variable is the city and its categorical. ## Explanatory What is the explanatory variable, and what type is it (numerical/categorical)? The explanatory variable is the number of forest fire and is numerical. ## Relevant summary statistics Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)
library(stringr)

forest_fire<-read.csv("https://raw.githubusercontent.com/Sizzlo/Data-Project-Proposal/master/amazon.csv", sep=",")

dim(forest_fire)

## [1] 6454    5

names(forest_fire)

## [1] "year"   "state"  "month"  "number" "date"

summary(forest_fire)

##       year              state            month          number     
##  Min.   :1998   Rio        : 717   Janeiro  : 541   Min.   :  0.0  
##  1st Qu.:2002   Mato Grosso: 478   Abril    : 540   1st Qu.:  3.0  
##  Median :2007   Paraiba    : 478   Agosto   : 540   Median : 24.0  
##  Mean   :2007   Alagoas    : 240   Fevereiro: 540   Mean   :108.3  
##  3rd Qu.:2012   Acre       : 239   Julho    : 540   3rd Qu.:113.0  
##  Max.   :2017   Amapa      : 239   Junho    : 540   Max.   :998.0  
##                 (Other)    :4063   (Other)  :3213                  
##          date     
##  1998-01-01: 324  
##  1999-01-01: 324  
##  2000-01-01: 324  
##  2001-01-01: 324  
##  2002-01-01: 324  
##  2003-01-01: 324  
##  (Other)   :4510

This groups the total number of forest fire in each state. Then I listed the total of forest fire from highest to lowest. I want to observe the top 10 places with the most forest fires. Using this data, we can be more alert for these areas that are more prone to forest fires. We can see that Matto Grosso have the highest total of forest fire.

fire1<-forest_fire %>% 
  group_by(state) %>% 
  summarise(Total=round(sum(number))) %>% 
  arrange(desc(Total))
  
fire1_top10<-fire1 %>% 
  slice(0:10)

ggplot(fire1_top10, aes(x=state, y=Total))+geom_bar(stat="identity")

We can also group by which year had the most forest fire. We can see that 2003 have the most forest fire that year. We can also look at the most forest fires within the top 5 years.

fire2<-forest_fire %>% 
  group_by(year) %>% 
  summarise(Total=round(sum(number))) %>% 
  arrange(desc(Total)) %>% 
  slice(0:5)
kable(head(fire2))

year	Total
2003	42761
2016	42212
2015	41208
2012	40085
2014	39621

ggplot(fire2, aes(x=year, y=Total))+geom_point(stat="identity")

Data Project Proposal

Tony Mei

10/12/2019

Research question