title: “STATISTICAL ANALYSIS OF THE AMAZON FOREST FIRES IN BRAZIL (1998 - 2017)”

authorS: Nikhil Vijay Rodrigues (S3699610), Krishna Rajendra Attal (S3670227)

date: 26th October 2019

output: slidy_presentation

Introduction:

Forest fires in Brazil pose a serious threat to the tropical forests in the Amazonas. The Amazon rainforest region is the biggest rainforest region on the planet. The behaviour of the forest fire pattern is unique compared to rest other forest fires in different other places in the world. The frequency of forest fires ranges differently as per state and the data was acquired over the years ranging from 1998 to 2017, i.e., of a span over a decade. The data was acquired from the Open Data Kaggle source and was subjected to statistical analysis. The varying information is state wise and year wise obtained from an Excel file in ‘.csv’ format. RStudio software is used to code the statistical analysis methods as the information data in the file is large and thus the computerised programming language is used to observe and study the analysis. One of the states in the legal Amazon domain contributes to significant effect. The entire forest region of the country of Brazil lies within only certain states in the tropical sector and some states adjoining the tropical zone in the temperate regions. As a result, of the 23 only 14 come under ‘legal Amazonas’ that lead to maximum extent of the forest cover. However, the analysis is done nonetheless for all the 23 states wherever the forest fires are reported either officially or unofficially. The analysis is done using RStudio whereby line plot, bar plot, histograms, box plot and dot plot lead to basic statistical analysis of the forest fires whereas the in-depth and detailed analysis is provided by the sample t-test and the hypothesis null test.

Problem Statement:

The aim of the study is to analyse the statistical aspect of the forest fires that occurred in the Brazil's Amazon rainforest region state wise and year wise over the period from 1998 to 2017 using various statistical analysis methods. The study will answer certain questions regarding the data acquired and emphasise on different techniques like constructing boxplots, histograms, hypothesis etc. methods to further analyse the results. The other objectives are to find out the least and maximum reported forest fire incidents for year wise or month wise and to note the record of number of incidents year wise. The significant aspects include calculating which state has the maximum impact due to forest fires, which years record the high and low counts, which month throughout the year spanned the impacts, what numbers do the effects have over the years or month wise analysis, evaluating scenarios from the usage of line plot, bar plot, histograms (month wise and state wise), box plot (month wise versus state wise comparison), dot plot as well as the in-depth analysis using the sample t-test method and the null hypothesis method. The final two methods would answer in detail for the basic tests which would otherwise not provide more information on the impact pattern. These would involve sample T-test analyses along with hypothesis test to elaborate more in detail about the nature of the closely studied upon forest fires in the earlier methods which would not provide detailed analysis about the variations and significance of the impacts they put upon.

Data:

The data for the study is obtained from the open data Kaggle source website: https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil. Some other concerning aspect of the information was taken from Brazil’s official website: http://dados.gov.br/dataset/sistema-nacional-de-informacoes-florestais-snif. The file consists of 6,455 rows and 7 columns. The information consists of about 6,454 objects with 6 variables and further de-classified to observe the statistical comparisons. Variables include ‘year’, ‘state’, ‘month’, ‘number’, ‘date’ and ‘amazonLegal’ as parameters while the rows represent the values over the periods. The states fall in the range as regions from Acre, Alagoas, Amapa, Amazonas, Bahia, Ceara, Distrito Federal, Espirito Santo, Goias, Maranhao, Mato Grosso, Minas Gerais, Para, Paraiba, Pernambuco, Piau, Rio, Rondonia, Roraima, Santa Catarina, Sao Paulo, Sergipe and Tocantins (21 regions of the 14 states). Months range from January from December (12 months). Years range from 1998 to 2017 (span of about a decade). The ‘amazonLegal’ variable bar indicates whether the forest fire that occurred during a particular incident was recognised by the forest department or not for that duration of the particular month. If yes, then the indication reads as ‘1’ otherwise it reads as ‘0’. The number variable field indicates the number of forest fires count during that particular month. Date field indicates the official date on which the data was recorded either officially or unofficially. For the RStudio aspect, data was further de-classified as amazon, amazon_data, fire, fire_amazon, fire_amazon2 and fire_amazon3 for analysis. Customised data include ‘fire’, ‘fire_amazon’, ‘fire_amazon2’ and ‘newdatasum1’. It is to be acknowledged that the information is directly obtained via open sourced Kaggle website ad the website link for directly locating the amazon.csv Excel file that was taken as data for the analysis is: https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil/downloads/amazon.csv/data. The .csv file is located in the folder on the desktop in order to call it during programming the analysis. It is also named ‘fires’ to make a duplicate of the same file in order to preserve the changes if happened unnecessarily. Variables: About twelve (12) variables were incorporated for the working of the file. ‘fireamazon’, ‘fireamazon1’, ‘fireamazon2’ and ‘fireamazon3’ indicated the calculation and analysis of fire effects of forests in legal areas with and without considering the effects of Mato Grosso. Next, fires, fires_years, fires2, fires3 and fires4 were used to indicate the lineplots for the effects of forest fires state-wise, month-wise and year-wise. ‘newdatasum1’ indicated the data for summary analysis while normal_fires indicated data for planar distribution for normalised effect. Variables are used to change the values instead of making another usage if they remain constant. The variables in the amazon.csv file consist of months from January to December, number count of forest fires in digits ranging from 0 to 100, states ranging from Acre to Sao Paulo, legal status ranging from either two values - 0 for not reported and 1 for legally counted and finally years ranging from 1998 to 2017. The years span for about two-decade period (20 years). The data is then incorporated for analysis using the amazon.csv file variables as well as the user-defined variables and both these combined are used for the statistical analysis of the forest fires in the Amazon region of Brazil. No constants are used since the data varies over the years of the two decades taken as sample periods.

Descriptive Statistics and Visualisation [Plain text & R code & Output]:

The following RStudio programming language was worked upon -

Importing library packages

library(tidyverse) library(extrafont) library(lubridate) library(data.table) library(scales)

Reading the file

fires <- read_csv(“F:/RMIT/Year 2 2019/Semester 4/Introduction to Statistics/Assignments/Assignment 3/amazon.csv”)

view(fires)

str(fires)

Re-naming months in English from Portuguese

fires2 <- fires %>% mutate(month = if_else(month == “Abril”, “April”, if_else(month == “Agosto”, “August”, if_else(month == “Dezembro”, “December”, if_else(month == “Fevereiro”, “February”, if_else(month == “Janeiro”, “January”, if_else(month == “Julho”, “July”, if_else(month == “Junho”, “June”, if_else(month == “Maio”, “May”, if_else(month == “Setembro”, “September”, if_else(month == “Novembro”, “November”, if_else(month == “Outubro”, “October”, “Mar”))))))))))))

Forest fires analysis by month

table(fires2$month)

Forest fires analysis by year

table(fires2$year)

Forest fires analysis by state

table(fires2$state)

Histogram plot of frequency v/s number of fires

fires4 %>% ggplot(aes(number)) + geom_histogram(fill = “red”) + geom_vline(xintercept = median(fires3$number), color = "blue") + geom_vline(xintercept = mean(fires3$number), color = “green”) + xlab(“Number of Fires”) + ylab(“Frequency”) + ggtitle(“Forest fire distribution from 1998 to 2017”) + theme(title = element_text(family = “Luminari”))

Calculation of mean and median

paste0(“Median:”, median(fires3$number), " fires per month") paste0("Mean: ", round(mean(fires3$number), 2), " fires per month")

Line-plot of number of forest fires by year

fires4 %>% group_by(state) %>% summarise(median_fires = median(number), mean_fires = mean(number)) %>% arrange(desc(median_fires))

fires4 %>% group_by(year) %>% summarise(fire_count = sum(number)) %>% ggplot(aes(x = year, y = fire_count)) + geom_point() + geom_line() + geom_text(aes(label = fire_count), color = “red”, vjust = 1.5) + xlab(“Year”) + ylab(“Number of Fires”) + ggtitle(“Number of Forest Fires by Year”) + theme(title = element_text(family = “Luminari”))

Line-plots of number of forest fires v/s year by state

fires4 %>% group_by(year, state) %>% summarise(fire_count = sum(number)) %>% ggplot(aes(x = year, y = fire_count)) + geom_line() + xlab(“Year”) + ylab(“Number of Fires”) + ggtitle(“Number of Forest Fires by Year”, subtitle = “By state”) + theme(title = element_text(family = “Luminari”)) + facet_wrap(~state) + theme(axis.text.y = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.ticks.y = element_blank())

Bar-plot of number of forest fires versus month

fires4 %>% group_by(month) %>% summarise(fires = sum(number)) %>% mutate(month = factor(month, levels = c(“January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”, “December”))) %>% ggplot(aes(month, fires, fill = if_else(month %in% c(“July”, “August”, “September”, “October”, “November”), “green”, “red”))) + geom_bar(stat = “identity”) + theme(legend.position = “none”) + xlab(“month”) + ylab(“Number of Fires”) + ggtitle(“Number of overall forest fires month-wise from 1998 to 2017”) + theme(title = element_text(family = “Luminari”))

Least versus maximum number of reported incidents analysis

str(amazon)

dim(amazon)

cat(“Least Number of incidents reported:”,min(amazon$number))

cat(" & Max number of incidents reported:",max(amazon$number))

Year wise total number of forest fires count

amazon %>% group_by(year) %>% summarise(number = sum(number))

Forest fires analysis by year in all Legal Amazon regions

amazon$amazonLegal <- ifelse(amazon$state %in% c(‘Acre’,‘Amapa’, ‘Amazonas’,‘Maranhao’,‘Mato Grosso’, ‘Par?’,‘Rondonia’,‘Roraima’, ‘Tocantins’),1,0)

fire_amazon <- amazon %>% filter(amazonLegal == 1) %>% group_by(year) %>% summarise(total = sum(number))

ggplot(data = fire_amazon, aes(x = year, y = total))+ geom_line(color = ‘tomato2’, size = 1.5)+ geom_smooth()+ labs(x = ‘year’, y = ‘Forest Fires in Legal Amazon’, title = ‘Forest Fires in Legal Amazon by Year’)

Forest fires analysis by year in Legal Amazon regions without Mato Grosso

fire_amazon2 <- amazon %>% filter(amazonLegal == 1 & !(state == ‘Mato Grosso’)) %>% group_by(year) %>% summarise(total = sum(number))

ggplot(data = fire_amazon2, aes(x = year, y = total))+ geom_line(color = ‘tomato2’, size = 1.5)+ geom_smooth()+ labs(x = ‘year’, y = ‘Forest Fires in Legal Amazon without Mato Grosso’, title = ‘Forest Fires in Legal Amazon without Mato Grosso’)

Summarisation of information at a glance

summary(amazon)

Sample t-tests and hypothesis tests

one-sided t test

t.test(number$year, mu=44.01, alternative = “two.sided”)

two-sided t test

t.test( number ~ integer, data = amazon, var.equal = F, alternative = “two.sided” )

chi square goodness of fit test

table(number$year) %>% prop.table()

chi1<-chisq.test(table(number$year), p = pop_prop)

chi1

chi1$observed

chi1$expected

chi square test of association

table(number$month, amazon$fires4)

chi2 <- chisq.test(table(number$month, amazon$fires4))

chi2

chi2$observed

chi2$expected

Disscussion

From Fig. 1, we can observe that the number of forest fires peak at about thousands from more than one count and slowly decline in frequency over the number of fires line. From Fig. 2, we can observe that the inclusion of the state of Mato Grosso impacts the forest fires to numbers around 15000. The plot seemed to soar high from 1998 to 2003 and then would increase to small extent or remain constant over a while till 2015 and experience downfall after 2015. The maximum effect would be observed in the year 2013. From Fig. 3, we can observe that the exclusion of the state of Mato Grosso impacts the forest fires to numbers around only up to 9000. The trend of the forest fire impact year wise was observed to the same as would be seen with the inclusion of the state of Mato Grosso. However, the change is observed in the numbers of forest fires that impacted the effect. From the line plot in Fig. 4, we observe that the least number of forest fires took place in 1998 reported to 20,014 in total that year while the maximum occurred in 2003 reported to about 42,756. It was a steep rise in forest fires seen and then it followed a zig-zag pattern of rise and fall till the end of 2017. The line plots in Fig. 5 indicate that the state of Rondonia has significant effect. Also, the state of Sao Paulo has good impact because of the fires while the state of Sergipe has less impact when compared to other states. There are other line plots but not mentioned for the state regions because those states do not fall under the legal Amazon regions. From the bar-plot in Fig. 6, we can deduce that the months from July to November accounted for significant number of forest fires with maximum in July and minimum in September whereas the remaining months accounted to average not so much effect. From the box plot in Fig. 7, we can notice that the state-wise impact of forest fires is less than the month-wise impact. The line plot increases linearly because of the low median and high mean in Fig. 8. For one sample t-test, the p-value is greater than 0.05 level of statistical significance, 95% CI capture the mean, it is not statistically significant. So, we fail to reject Ho. For two sample t-test, p-value is greater than 0.05 level of statistical significance, 95% CI capture the mean, it is not statistically significant. So, we failed to reject Ho. For chi-square goodness of fit test, p-value is greater than 0.05, hence we fail to reject to Ho. The frequency of forest fires is nearly same for 3 years. Finally, for chi-square test of association, p-value is very small (2.2e-16), hence we reject to Ho. It is difficult to ascertain no relationship between the month-wise and year-wise forest-sires frequency impact irrespective of whether or not the incidents made to the legal count.

Conclusion

Thus, we have successfully determined and analysed the statistical information of the forest-fires in Amazon regions of Brazil from the period span of two decades running from 1998 to 2017 with the forest fires count numbers state-wise from January to December using different statistical techniques and models with the help of R-Studio programming. We used different statistical functions to filter and work upon the required information. Next, we used line plots, bar plots, boxplots and histograms to evaluate on the information. Further, we conducted sample t-tests and hypothesis tests to evaluate in depth further regarding the data. Finally, we conclude that the Amazon forest fires in Brazil in the legal Amazon areas did have considerable impact over he line from 2010 to 2017 although the rise in the fires was steep steady from 1998 to 2003 and constant from 2003 to 2010. The state of Mato Grosso was heavily impacted while Sao Paulo wasn't impacted with significance. The regions of states that did not show any data were not part of the legally considered Amazon regions of the tropical rain-forest but did account to forest fires that may have gone un-recorded. Hence, their data was incorporated into the in-depth analysis.

References

[1] https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil

[2] http://dados.gov.br/dataset/sistema-nacional-de-informacoes-florestais-snif

[3] MATH 1324 | Introduction to Statistics | Lecture slides | RMIT University

[4] https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil/downloads/amazon.csv/data

[5] RStudio programming tutorials | RMIT University

[6] MATH 1324 | Introduction to Statistics | Class Worksheets | RMIT University

[7] MATH 1324 | Introduction to Statistics | Class Modules | RMIT University