1 Background

Kebakaran hutan merupakan salah satu isu penting di dunia. hampir disetiap daerah di dunia terdapat titik kebakaran hutan, mulai dari Amerika, Australia, hingga Eropa. isu kebakaran hutan sangat mempengaruhi setiap aspek kehidupan makhluk hidup. Banyak yang menjadi penyebab kebakaran hutan seperti kelalaian manusia, kenaikan suhu dunia. meskipun sudah banyak negara yang mengeluarkan dana yang tidak sedikit untuk menanggulangi bencana ini, tetapi tetap saja terjadi bencana ini. Sebagai contoh di Portugal pada tahun 1980 sampai 2005 lebih dari 2.7 juta ha sudah habis terbakar. Terdapat alat bantu untuk memberikan rating bahaya pada kebakaran hutan dengan menggunakan Canadian System yaitu Fire Weather Index (FWI).

Pada project kali ini, saya mencoba untuk membuat model terbaik untuk memprediksikan kebakaran hutan. saya menggunakan data kebakaran hutan didapatkan dari Cortez dan Morais (2007). data yang diambil berkisar dari bulan January 2000 hingga Desember 2003. Lokasi tempat pengambilan data berasal dari Montensinho National Park. Dapat dilihat data tersebut berisi variabel:

Memuat library

library(readr)
library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(MASS)
library(MLmetrics)

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

library(tidyverse)

## -- Attaching packages --------------------------- tidyverse 1.3.0 --

## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v purrr   0.3.3     v forcats 0.4.0

## -- Conflicts ------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()

library(tidymodels)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## -- Attaching packages -------------------------- tidymodels 0.0.3 --

## v broom     0.5.3     v recipes   0.1.9
## v dials     0.0.4     v rsample   0.0.5
## v infer     0.5.1     v yardstick 0.0.4
## v parsnip   0.0.5

## -- Conflicts ----------------------------- tidymodels_conflicts() --
## x scales::discard()   masks purrr::discard()
## x dplyr::filter()     masks stats::filter()
## x recipes::fixed()    masks stringr::fixed()
## x dplyr::lag()        masks stats::lag()
## x dials::margin()     masks ggplot2::margin()
## x dplyr::select()     masks MASS::select()
## x yardstick::spec()   masks readr::spec()
## x recipes::step()     masks stats::step()
## x recipes::yj_trans() masks scales::yj_trans()

library(tidyr)
library(funModeling)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:parsnip':
## 
##     translate

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai

## 
## Attaching package: 'funModeling'

## The following object is masked from 'package:GGally':
## 
##     range01

library(ranger)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ranger':
## 
##     importance

## The following object is masked from 'package:dials':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:Hmisc':
## 
##     subplot

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(rpart)

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:dials':
## 
##     prune

library(recipes)
library(png)
library(ggpubr)

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

Memuat dataset

forest <- read_csv("forestfires.csv")

## Parsed with column specification:
## cols(
##   X = col_double(),
##   Y = col_double(),
##   month = col_character(),
##   day = col_character(),
##   FFMC = col_double(),
##   DMC = col_double(),
##   DC = col_double(),
##   ISI = col_double(),
##   temp = col_double(),
##   RH = col_double(),
##   wind = col_double(),
##   rain = col_double(),
##   area = col_double()
## )

str(forest)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 517 obs. of  13 variables:
##  $ X    : num  7 7 7 8 8 8 8 8 8 7 ...
##  $ Y    : num  5 4 4 6 6 6 6 6 6 5 ...
##  $ month: chr  "mar" "oct" "oct" "mar" ...
##  $ day  : chr  "fri" "tue" "sat" "fri" ...
##  $ FFMC : num  86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
##  $ DMC  : num  26.2 35.4 43.7 33.3 51.3 ...
##  $ DC   : num  94.3 669.1 686.9 77.5 102.2 ...
##  $ ISI  : num  5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
##  $ temp : num  8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
##  $ RH   : num  51 33 33 97 99 29 27 86 63 40 ...
##  $ wind : num  6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
##  $ rain : num  0 0 0 0.2 0 0 0 0 0 0 ...
##  $ area : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X = col_double(),
##   ..   Y = col_double(),
##   ..   month = col_character(),
##   ..   day = col_character(),
##   ..   FFMC = col_double(),
##   ..   DMC = col_double(),
##   ..   DC = col_double(),
##   ..   ISI = col_double(),
##   ..   temp = col_double(),
##   ..   RH = col_double(),
##   ..   wind = col_double(),
##   ..   rain = col_double(),
##   ..   area = col_double()
##   .. )

pada data tersebut terdapat 517 observasi dan 13 variabel. penjelasan untuk masing-masing variabel adalah :

X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
month - month of the year: ‘jan’ to ‘dec’
day - day of the week: ‘mon’ to ‘sun’
FFMC - FFMC index dari FWI system: 18.7 to 96.20
DMC - DMC index dari FWI system: 1.1 to 291.3
DC - DC index dari FWI system: 7.9 to 860.6
ISI - ISI index dari FWI system: 0.0 to 56.10
temp - suhu in Celsius degrees: 2.2 to 33.30
RH - kelembaban relatif in %: 15.0 to 100
wind - kecepatan angin in km/h: 0.40 to 9.40
rain - curah hujan in mm/m2 : 0.0 to 6.4
area - area yang terbakar (in ha): 0.00 to 1090.84

2 Exploratory Data Analysis

Sebelum memulai Exploratory data kita dapat mengecek nilai NA pada setiap variable

colSums(is.na(forest))

##     X     Y month   day  FFMC   DMC    DC   ISI  temp    RH  wind  rain  area 
##     0     0     0     0     0     0     0     0     0     0     0     0     0

Kesimpulan dari pengecekan ini tidak terdapat missing value pada data

Untuk melihat gambaran dari datanya kita dapat menggunakan fungsi head pada data

head(forest,1)

## # A tibble: 1 x 13
##       X     Y month day    FFMC   DMC    DC   ISI  temp    RH  wind  rain  area
##   <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     7     5 mar   fri    86.2  26.2  94.3   5.1   8.2    51   6.7     0     0

Memperbaiki nama bulan menjadi kapital dan sesuai ejaan yang disempurnakan serta mengatur level bulan agar sesuai dengan yang diinginkan

forest$month <- sapply(as.character(forest$month), switch, 
                     "jan" = "January",
                     "feb" = "February",
                     "mar" = "March",
                     "apr" = "April",
                     "may" = "May",
                     "jun" = "June",
                     "jul" = "July",
                     "aug" = "August",
                     "sep" = "September",
                     "oct" = "October",
                     "nov" = "November",
                     "dec" = "December")
forest$month <- as.factor(forest$month)
str(forest)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 517 obs. of  13 variables:
##  $ X    : num  7 7 7 8 8 8 8 8 8 7 ...
##  $ Y    : num  5 4 4 6 6 6 6 6 6 5 ...
##  $ month: Factor w/ 12 levels "April","August",..: 8 11 11 8 8 2 2 2 12 12 ...
##  $ day  : chr  "fri" "tue" "sat" "fri" ...
##  $ FFMC : num  86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
##  $ DMC  : num  26.2 35.4 43.7 33.3 51.3 ...
##  $ DC   : num  94.3 669.1 686.9 77.5 102.2 ...
##  $ ISI  : num  5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
##  $ temp : num  8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
##  $ RH   : num  51 33 33 97 99 29 27 86 63 40 ...
##  $ wind : num  6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
##  $ rain : num  0 0 0 0.2 0 0 0 0 0 0 ...
##  $ area : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X = col_double(),
##   ..   Y = col_double(),
##   ..   month = col_character(),
##   ..   day = col_character(),
##   ..   FFMC = col_double(),
##   ..   DMC = col_double(),
##   ..   DC = col_double(),
##   ..   ISI = col_double(),
##   ..   temp = col_double(),
##   ..   RH = col_double(),
##   ..   wind = col_double(),
##   ..   rain = col_double(),
##   ..   area = col_double()
##   .. )

forest$month <- factor(forest$month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
unique(forest$month)

##  [1] March     October   August    September April     June      July     
##  [8] February  January   December  May       November 
## 12 Levels: January February March April May June July August ... December

Memperbaiki nama hari menjadi kapital dan sesuai ejaan yang disempurnakan serta mengatur level bulan agar sesuai dengan yang diinginkan

forest$day <- sapply(as.character(forest$day), switch, 
                     "sun" = "Sunday",
                     "mon" = "Monday",
                     "tue" = "Tuesday",
                     "wed" = "Wednesday",
                     "thu" = "Thursday",
                     "fri" = "Friday",
                     "sat" = "Saturday")
forest$day <- as.factor(forest$day)
str(forest)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 517 obs. of  13 variables:
##  $ X    : num  7 7 7 8 8 8 8 8 8 7 ...
##  $ Y    : num  5 4 4 6 6 6 6 6 6 5 ...
##  $ month: Factor w/ 12 levels "January","February",..: 3 10 10 3 3 8 8 8 9 9 ...
##  $ day  : Factor w/ 7 levels "Friday","Monday",..: 1 6 3 1 4 4 2 2 6 3 ...
##  $ FFMC : num  86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
##  $ DMC  : num  26.2 35.4 43.7 33.3 51.3 ...
##  $ DC   : num  94.3 669.1 686.9 77.5 102.2 ...
##  $ ISI  : num  5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
##  $ temp : num  8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
##  $ RH   : num  51 33 33 97 99 29 27 86 63 40 ...
##  $ wind : num  6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
##  $ rain : num  0 0 0 0.2 0 0 0 0 0 0 ...
##  $ area : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X = col_double(),
##   ..   Y = col_double(),
##   ..   month = col_character(),
##   ..   day = col_character(),
##   ..   FFMC = col_double(),
##   ..   DMC = col_double(),
##   ..   DC = col_double(),
##   ..   ISI = col_double(),
##   ..   temp = col_double(),
##   ..   RH = col_double(),
##   ..   wind = col_double(),
##   ..   rain = col_double(),
##   ..   area = col_double()
##   .. )

forest$day <- factor(forest$day, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
unique(forest$day)

## [1] Friday    Tuesday   Saturday  Sunday    Monday    Wednesday Thursday 
## Levels: Sunday Monday Tuesday Wednesday Thursday Friday Saturday

Mengecek hari dan bulan sudah terkoreksi dengan baik

head(forest,3)

## # A tibble: 3 x 13
##       X     Y month  day    FFMC   DMC    DC   ISI  temp    RH  wind  rain  area
##   <dbl> <dbl> <fct>  <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     7     5 March  Frid~  86.2  26.2  94.3   5.1   8.2    51   6.7     0     0
## 2     7     4 Octob~ Tues~  90.6  35.4 669.    6.7  18      33   0.9     0     0
## 3     7     4 Octob~ Satu~  90.6  43.7 687.    6.7  14.6    33   1.3     0     0

Mengecek variabel yang bisa ditoleransi untuk menjadi prediktor

ggcorr(forest, label = T, hjust = 1)

## Warning in ggcorr(forest, label = T, hjust = 1): data in column(s) 'month',
## 'day' are not numeric and were ignored

Summary: Terdapat hubungan yang kuat antar prediktor DMC dan DC dengan nilai dari ggcorr sebesar 0.7 (mendekati 1) dan RH dengan temp sebesar -0.5 (mendekati -1)

perlu mengecek outlier yang ada pada variable area menggunakan histogram

hist(forest$area, breaks = 50)

Mengecek outlier setiap variable

plot_num(forest)

mengecek variable bulan yang terdapat didata

forest_m <- forest %>%
    group_by(month) %>% 
    summarise(Quantity=(n()), 
              Total_Area= (sum(area))) %>% 
    arrange(month)

forest_m

## # A tibble: 12 x 3
##    month     Quantity Total_Area
##    <fct>        <int>      <dbl>
##  1 January          2        0  
##  2 February        20      126. 
##  3 March           54      235. 
##  4 April            9       80.0
##  5 May              2       38.5
##  6 June            17       99.3
##  7 July            32      460. 
##  8 August         184     2298. 
##  9 September      172     3086. 
## 10 October         15       99.6
## 11 November         1        0  
## 12 December         9      120.

Menghilangkan data yang tidak penting seperti bulan yang terjadi kebakarannya kecil (January, May, dan November) serta area yang terbakar > 100 Ha. Penghapusan data ini sangat penting dalam Regresi Linear karena dapat mempengaruhi nilai errornya.

forest_f <- forest %>%  
 select (X, Y, FFMC, DMC, DC, ISI, temp, RH, rain, wind, area, month, day) %>% 
 filter (area<=100) %>% 
 filter (month =="February" | month =="March" | month =="April" | month =="June" | month =="July" | month =="August" | month =="September" | month =="October" | month =="December")
head(forest_f,1)

## # A tibble: 1 x 13
##       X     Y  FFMC   DMC    DC   ISI  temp    RH  rain  wind  area month day   
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> 
## 1     7     5  86.2  26.2  94.3   5.1   8.2    51     0   6.7     0 March Friday

melakukan pengecekan untuk melihat variable area yang sudah difilter

hist(forest_f$area, breaks=50)

2.1 Plot Month vs Area

Membuat plot bulan vs area, untuk meihat pada bulan apa saja tingkat kebakaran hutan yang paling besar

forest_mth <- forest %>%
    group_by(month) %>% 
    summarise(Quantity=(n()/10), 
              Total_Area= (sum(area)/100),
              temperature = round(mean(temp),2), 
              RH = round(mean(RH),2), 
              rain = round(mean(rain),2), 
              wind = round(mean(wind),2)) %>% 
    arrange(month)

forest_m <- pivot_longer(forest_mth, cols = c("Quantity", "Total_Area"))
forest_m

## # A tibble: 24 x 7
##    month    temperature    RH  rain  wind name       value
##    <fct>          <dbl> <dbl> <dbl> <dbl> <chr>      <dbl>
##  1 January         5.25  89       0  2    Quantity   0.2  
##  2 January         5.25  89       0  2    Total_Area 0    
##  3 February        9.63  55.7     0  3.75 Quantity   2    
##  4 February        9.63  55.7     0  3.75 Total_Area 1.25 
##  5 March          13.1   40       0  4.97 Quantity   5.4  
##  6 March          13.1   40       0  4.97 Total_Area 2.35 
##  7 April          12.0   46.9     0  4.67 Quantity   0.9  
##  8 April          12.0   46.9     0  4.67 Total_Area 0.800
##  9 May            14.6   67       0  4.45 Quantity   0.2  
## 10 May            14.6   67       0  4.45 Total_Area 0.385
## # ... with 14 more rows

plot1 <- ggplot(forest_m, aes(x = month, y = value, text = paste("Value:", value))) + 
        geom_col(mapping = aes(fill = name), position = "dodge") + 
        labs(x = "Quantity (#)",
            y = "",
            fill = "Legends") +
        # facet_grid(name~., scales = "free_y") +
         scale_fill_discrete(aesthetics = "fill", labels = c("Quantity (#)","Area (ha)"),) 

ggplotly(plot1, tooltip = "text")

2.2 Plot Day vs Area

Membuat plot hari vs area, untuk meihat pada bulan apa saja tingkat kebakaran hutan yang paling besar, serta mengambil contoh pada bulan September, karena pada bulan september tingkat kebakaran hutan terjadi yang paling tinggi

forest_day <- forest %>%
    filter(month == 'September') %>%
    group_by(day) %>%
    summarise(freq=n(), sum_area=sum(area)) %>%
    ungroup()

forest_day

## # A tibble: 7 x 3
##   day        freq sum_area
##   <fct>     <int>    <dbl>
## 1 Sunday       27     378.
## 2 Monday       28     166.
## 3 Tuesday      19     501.
## 4 Wednesday    14     182.
## 5 Thursday     21     112.
## 6 Friday       38     201.
## 7 Saturday     25    1545.

# position dodge
forest_day <- forest %>%
    filter(month == 'September') %>%
    group_by(day) %>%
    summarise(Quantity=n(), sum_area=sum(area)) %>%
    ungroup()

forest_day <- pivot_longer(forest_day, cols = c("Quantity", "sum_area"))
forest_day

## # A tibble: 14 x 3
##    day       name     value
##    <fct>     <chr>    <dbl>
##  1 Sunday    Quantity   27 
##  2 Sunday    sum_area  378.
##  3 Monday    Quantity   28 
##  4 Monday    sum_area  166.
##  5 Tuesday   Quantity   19 
##  6 Tuesday   sum_area  501.
##  7 Wednesday Quantity   14 
##  8 Wednesday sum_area  182.
##  9 Thursday  Quantity   21 
## 10 Thursday  sum_area  112.
## 11 Friday    Quantity   38 
## 12 Friday    sum_area  201.
## 13 Saturday  Quantity   25 
## 14 Saturday  sum_area 1545.

plot2 <- ggplot(forest_day, aes(x = day, y = value, text = paste("value:", value))) + 
  geom_col(mapping = aes(fill = name), position = "dodge") + 
  labs(x = "",
       y = "",
       fill = "Legends") +
  facet_grid(scales = "free_y") +
  scale_fill_discrete(aesthetics = "fill", # state aes to costum
                      labels = c("Quantity (#)","Area (ha)"),)

ggplotly(plot2, tooltip = "text")

3 Machine Learning

Kita akan mencoba beberapa machine learning seperti Regression Linear, Decision Tree, dan Random Forest untuk memprediksi kebakaran hutan. Serta Metric Evaluation yang akan dipakai adalah RMSE atau Random Mean Square Error karena Metric tersebut cocok untuk data yang memiliki sedikit kategorikal dan target yang berupa numerikal.

3.1 Cross Validation

Membagi data jadi data train dan test

set.seed(100)
idx <-  sample(nrow(forest_f), nrow(forest_f)*0.8)
forest_train <- forest_f[idx,]  
forest_test <-  forest_f[-idx,]

3.2 Regression Linear

Membuat model tanpa prediktor

ff.none <- lm(area~1, forest_train)

Cek model dengan semua prediktor

ff.all <- lm(area~., forest_train)

Menggunakan Stepwise regression untuk mengecek nilai AIC

ff.back <- stats::step(ff.all, direction = "backward")

## Start:  AIC=2138.75
## area ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + rain + wind + 
##     month + day
## 
##         Df Sum of Sq   RSS    AIC
## - day    6    637.48 74750 2130.2
## - month  8   2533.10 76645 2136.2
## - rain   1     35.71 74148 2136.9
## - FFMC   1     36.15 74148 2136.9
## - Y      1     40.22 74153 2137.0
## - X      1     99.80 74212 2137.3
## - ISI    1    149.40 74262 2137.6
## - RH     1    332.84 74445 2138.5
## <none>               74112 2138.8
## - DMC    1    430.83 74543 2139.1
## - wind   1    508.51 74621 2139.5
## - DC     1    555.77 74668 2139.7
## - temp   1   1323.97 75436 2143.8
## 
## Step:  AIC=2130.17
## area ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + rain + wind + 
##     month
## 
##         Df Sum of Sq   RSS    AIC
## - month  8   2615.50 77365 2127.9
## - Y      1     32.27 74782 2128.3
## - rain   1     35.14 74785 2128.4
## - FFMC   1     35.40 74785 2128.4
## - X      1     69.94 74820 2128.6
## - ISI    1    127.57 74877 2128.9
## <none>               74750 2130.2
## - DMC    1    398.04 75148 2130.3
## - RH     1    403.48 75153 2130.3
## - wind   1    531.13 75281 2131.0
## - DC     1    656.23 75406 2131.7
## - temp   1   1504.07 76254 2136.1
## 
## Step:  AIC=2127.93
## area ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + rain + wind
## 
##        Df Sum of Sq   RSS    AIC
## - Y     1      0.01 77365 2125.9
## - DMC   1      2.14 77367 2125.9
## - DC    1      4.26 77370 2125.9
## - rain  1     19.94 77385 2126.0
## - FFMC  1     34.09 77399 2126.1
## - X     1     40.68 77406 2126.1
## - RH    1     48.68 77414 2126.2
## - ISI   1    232.12 77597 2127.1
## <none>              77365 2127.9
## - temp  1    632.44 77998 2129.2
## - wind  1    776.64 78142 2129.9
## 
## Step:  AIC=2125.93
## area ~ X + FFMC + DMC + DC + ISI + temp + RH + rain + wind
## 
##        Df Sum of Sq   RSS    AIC
## - DMC   1      2.12 77367 2123.9
## - DC    1      4.44 77370 2123.9
## - rain  1     19.94 77385 2124.0
## - FFMC  1     34.09 77399 2124.1
## - RH    1     48.70 77414 2124.2
## - X     1     53.66 77419 2124.2
## - ISI   1    232.41 77598 2125.1
## <none>              77365 2125.9
## - temp  1    632.76 77998 2127.2
## - wind  1    781.45 78147 2127.9
## 
## Step:  AIC=2123.94
## area ~ X + FFMC + DC + ISI + temp + RH + rain + wind
## 
##        Df Sum of Sq   RSS    AIC
## - DC    1     10.42 77378 2122.0
## - rain  1     20.35 77388 2122.1
## - FFMC  1     38.73 77406 2122.1
## - X     1     53.30 77421 2122.2
## - RH    1     67.09 77435 2122.3
## - ISI   1    232.74 77600 2123.1
## <none>              77367 2123.9
## - temp  1    741.24 78109 2125.8
## - wind  1    800.39 78168 2126.1
## 
## Step:  AIC=2122
## area ~ X + FFMC + ISI + temp + RH + rain + wind
## 
##        Df Sum of Sq   RSS    AIC
## - rain  1     22.12 77400 2120.1
## - X     1     49.15 77427 2120.2
## - FFMC  1     52.35 77430 2120.3
## - RH    1    101.25 77479 2120.5
## - ISI   1    246.50 77624 2121.3
## <none>              77378 2122.0
## - wind  1    789.98 78168 2124.1
## - temp  1   1021.75 78400 2125.2
## 
## Step:  AIC=2120.11
## area ~ X + FFMC + ISI + temp + RH + wind
## 
##        Df Sum of Sq   RSS    AIC
## - X     1     44.73 77445 2118.3
## - FFMC  1     49.07 77449 2118.4
## - RH    1     88.28 77488 2118.6
## - ISI   1    243.78 77644 2119.4
## <none>              77400 2120.1
## - wind  1    773.60 78174 2122.1
## - temp  1    999.63 78400 2123.2
## 
## Step:  AIC=2118.34
## area ~ FFMC + ISI + temp + RH + wind
## 
##        Df Sum of Sq   RSS    AIC
## - FFMC  1     45.84 77491 2116.6
## - RH    1     95.31 77540 2116.8
## - ISI   1    238.89 77684 2117.6
## <none>              77445 2118.3
## - wind  1    766.05 78211 2120.3
## - temp  1    983.03 78428 2121.4
## 
## Step:  AIC=2116.58
## area ~ ISI + temp + RH + wind
## 
##        Df Sum of Sq   RSS    AIC
## - RH    1     86.34 77577 2115.0
## - ISI   1    194.48 77685 2115.6
## <none>              77491 2116.6
## - wind  1    747.18 78238 2118.4
## - temp  1   1109.10 78600 2120.3
## 
## Step:  AIC=2115.02
## area ~ ISI + temp + wind
## 
##        Df Sum of Sq   RSS    AIC
## - ISI   1    171.80 77749 2113.9
## <none>              77577 2115.0
## - wind  1    716.97 78294 2116.7
## - temp  1   1110.55 78687 2118.7
## 
## Step:  AIC=2113.91
## area ~ temp + wind
## 
##        Df Sum of Sq   RSS    AIC
## <none>              77749 2113.9
## - wind  1    604.45 78353 2115.0
## - temp  1    938.92 78688 2116.7

Cek nilai r-squared untuk ff.back

summary(ff.back)

## 
## Call:
## lm(formula = area ~ temp + wind, data = forest_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.699  -6.699  -4.834  -0.023  87.009 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -1.7019     3.2920  -0.517   0.6055  
## temp          0.2753     0.1257   2.190   0.0291 *
## wind          0.7160     0.4075   1.757   0.0797 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.99 on 397 degrees of freedom
## Multiple R-squared:  0.01588,    Adjusted R-squared:  0.01092 
## F-statistic: 3.202 on 2 and 397 DF,  p-value: 0.04172

Predict model dengan hasil test, dengan parameter confidence interval

model.ff.back.con <- predict(ff.back, forest_test, interval = "confidence", level=0.95)

Predict model dengan hasil test, dengan parameter prediction interval

model.ff.back.pre <- predict(ff.back, forest_test, interval = "prediction", level=0.95)

Mendapatkan nilai RMSE

RMSE(model.ff.back.con, forest_test$area)

## [1] 13.01183

RMSE(model.ff.back.pre, forest_test$area)

## [1] 25.96001

jJika menggunakan machine learning Regresssion Linear didapatkan nilai Error sebesar 13.0 yang berarti nilai error tersebut mendapatkan lebih kurang 13.0 hektar

3.3 Random Forest

Cross validation untuk machine learning random forest

set.seed(100)
# sample split
splitted <- initial_split(forest_f, prop = 0.8)

# define preprocess recipe from train dataset
rec <- recipe(area ~ ., data = training(splitted)[,-c(1:2)]) %>%
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()

# prepare recipes-revert functions
rec_rev <- function(x, rec) {
  means <- rec$steps[[2]]$means[["area"]]
  sds <- rec$steps[[3]]$sds[["area"]]
  x <- (x * sds + means) ^ 2
  x
}

# preprocess train and test dataset
data_train <- juice(rec)
data_test <- bake(rec, testing(splitted))

# quick check
head(data_train, 10)

## # A tibble: 10 x 11
##        FFMC    DMC      DC    ISI    temp     RH   rain   wind month day  
##       <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <fct> <fct>
##  1 -0.879   -1.46  -1.76   -0.882 -1.93    0.526 -0.105  1.37  March Frid~
##  2 -0.00580 -1.21   0.520  -0.438 -0.0491 -0.658 -0.105 -2.14  Octo~ Tues~
##  3 -0.00580 -1.00   0.569  -0.438 -0.623  -0.658 -0.105 -1.73  Octo~ Satu~
##  4 -0.261   -0.834 -1.71    0.246 -1.23    2.91  -0.105 -1.30  March Sund~
##  5  0.326   -0.196 -0.0138  1.23   0.589  -0.963 -0.105  0.802 Augu~ Sund~
##  6  0.170    0.672  0.350   0.478 -1.97    2.33  -0.105 -0.995 Augu~ Mond~
##  7  0.0725   0.463  0.584  -0.361 -0.897   1.20  -0.105  0.802 Sept~ Tues~
##  8  0.365   -0.152  0.600  -0.336  0.675  -0.166 -0.105  0.110 Sept~ Satu~
##  9  0.365   -0.152  0.600  -0.336 -0.0812  0.526 -0.105  1.57  Sept~ Satu~
## 10 -5.79    -0.449  0.510  -2.71  -0.212   1.67  -0.105  1.37  Augu~ Frid~
## # ... with 1 more variable: area <dbl>

# define model specification
model_spec <- rand_forest(
  mode = "regression",
  mtry = ncol(data_train) - 1,
  trees = 500,
  min_n = 1
)

# define model engine
model_engine <- set_engine(
  object = model_spec,
  engine = "ranger",
  seed = 100,
  num.threads = parallel::detectCores() / 2,
  importance = "impurity"
)

# quick check
model_engine

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = ncol(data_train) - 1
##   trees = 500
##   min_n = 1
## 
## Engine-Specific Arguments:
##   seed = 100
##   num.threads = parallel::detectCores()/2
##   importance = impurity
## 
## Computational engine: ranger

# fit the model
model <- fit_xy(
  object = model_engine,
  x = dplyr::select(data_train, -area),
  y = dplyr::select(data_train, area)
)

# quick check
model

## parsnip model object
## 
## Fit time:  130ms 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~ncol(data_train) -      1, num.trees = ~500, min.node.size = ~1, seed = ~100, num.threads = ~parallel::detectCores()/2,      importance = ~"impurity", verbose = FALSE) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      401 
## Number of independent variables:  10 
## Mtry:                             10 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       1.125203 
## R squared (OOB):                  -0.125203

coord <- testing(splitted)[,1:2]
# predict on test
pred_test <- dplyr::select(data_test, area) %>%
  bind_cols(predict(model, data_test)) %>%
  mutate_all(~ rec_rev(., rec = rec)) %>% 
  cbind(coord) %>%
  mutate(rmse = rmse_vec(truth = area, estimate = .pred)) %>%
  rename("Area_Prediction" = .pred) %>% 
  mutate(Category = case_when(Area_Prediction <= 20 ~ "Low",
                               Area_Prediction >= 20 & Area_Prediction <=40 ~ "Medium",
                               Area_Prediction >= 40 ~ "High")) %>% 
  mutate(Category = as.factor(Category))

pred_test %>% 
  filter(Area_Prediction >= 5 & Area_Prediction <= 5.8)

##   area Area_Prediction X Y     rmse Category
## 1 0.68        5.098211 4 3 11.60533      Low
## 2 0.00        5.581307 7 5 11.60533      Low
## 3 4.95        5.750346 3 4 11.60533      Low

mengecek nilai error untuk machine learning random forest

# calculate error and rsquared
rmse_rf <- pred_test %>%
          summarise(rmse = rmse_vec(truth = area, estimate = Area_Prediction))
rmse_rf$rmse

## [1] 11.60533

Jika menggunakan random Forest didapatkan nilai Error sebesar 11.6 yang berarti nilai error tersebut mendapatkan lebih kurang 11.6 hektar

saveRDS(model, file = "model.rds")

3.4 Decision Tree

Membuat model menggunakan Decision Tree Regression

fit <- rpart(area~., method = "anova", forest_train)

plot(fit, uniform=TRUE, 
 main="Regression Tree for Forest Fire")
 text(fit, use.n=TRUE, cex = .6)

printcp(fit)

## 
## Regression tree:
## rpart(formula = area ~ ., data = forest_train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] day  ISI  temp
## 
## Root node error: 79003/400 = 197.51
## 
## n= 400 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.053846      0   1.00000 1.0079 0.20390
## 2 0.045397      2   0.89231 1.1434 0.21105
## 3 0.010000      3   0.84691 1.1362 0.19518

par(mfrow=c(1,2)) 
rsq.rpart(fit)

## 
## Regression tree:
## rpart(formula = area ~ ., data = forest_train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] day  ISI  temp
## 
## Root node error: 79003/400 = 197.51
## 
## n= 400 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.053846      0   1.00000 1.0079 0.20390
## 2 0.045397      2   0.89231 1.1434 0.21105
## 3 0.010000      3   0.84691 1.1362 0.19518

# prune the tree
pfit<- prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])

Menguji model Decision Tree yang sudah dibuat dengan cara diuji prediksi pada data test

model.fit <- predict(fit, forest_test, method="anova")

Mengecek nilai rmse dari model.fit

rf_RMSE <- RMSE(model.fit, forest_test$area)
rf_RMSE

## [1] 13.5371

Jika menggunakan mahine learning Decision Tree didapatkan nilai Error sebesar 13.5 yang berarti nilai error tersebut mendapatkan lebih kurang 13.5 hektar

4 Mapping location and category predict area burned

Berikut ini kita akan menampilkan area prediksi yang berpeluang terkena kebakaran hutan, titik tersebut dapat ditempilkan dengan peta taman nasional montensinho

Memuat peta dasar

map <- png::readPNG("WWW/The-map-of-the-Montesinho-natural.png")

Memuat titik prediksi kebakaran hutan

plot3 <- ggplot(pred_test, aes(X, Y)) + geom_point() + scale_y_reverse()
plot3

Menggabungkan titik perdiksi dengan peta dasar

plot4 <- plot3 + 
    background_image(map) + 
    geom_point(aes(colour = Category), size=pred_test$Area_Prediction, alpha= 0.6) +
    scale_x_continuous(limits = c(0,9), breaks = seq(0,9,1)) +
    scale_y_continuous(limits = c(0,9), breaks = seq(0,9,1)) +
    scale_y_reverse() +
    labs(title="Forest Fire Prediction Map", caption="source: Montensinho National Park")

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

plot4

5 Conclusion

Berdasarkan Metric yang sudah ditentukan maka dapat dipilih model prediksi menggunakan Random Forest, karena memberikan hasil yang lebih baik dengan nilai RMSE sebesar 11.6. Maka dari itu model yang terbaik untuk prediksi kebakaran hutan berdasarkan area sebaiknya menggunakan model dari Random Forest.

Predict the Best Model for Forest Fire

Alif Aziz

12/20/2019