Kebakaran hutan merupakan salah satu isu penting di dunia. hampir disetiap daerah di dunia terdapat titik kebakaran hutan, mulai dari Amerika, Australia, hingga Eropa. isu kebakaran hutan sangat mempengaruhi setiap aspek kehidupan makhluk hidup. Banyak yang menjadi penyebab kebakaran hutan seperti kelalaian manusia, kenaikan suhu dunia. meskipun sudah banyak negara yang mengeluarkan dana yang tidak sedikit untuk menanggulangi bencana ini, tetapi tetap saja terjadi bencana ini. Sebagai contoh di Portugal pada tahun 1980 sampai 2005 lebih dari 2.7 juta ha sudah habis terbakar. Terdapat alat bantu untuk memberikan rating bahaya pada kebakaran hutan dengan menggunakan Canadian System yaitu Fire Weather Index (FWI).
Pada project kali ini, saya mencoba untuk membuat model terbaik untuk memprediksikan kebakaran hutan. saya menggunakan data kebakaran hutan didapatkan dari Cortez dan Morais (2007). data yang diambil berkisar dari bulan January 2000 hingga Desember 2003. Lokasi tempat pengambilan data berasal dari Montensinho National Park. Dapat dilihat data tersebut berisi variabel:
Memuat library
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
## -- Attaching packages --------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v purrr 0.3.3 v forcats 0.4.0
## -- Conflicts ------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## -- Attaching packages -------------------------- tidymodels 0.0.3 --
## v broom 0.5.3 v recipes 0.1.9
## v dials 0.0.4 v rsample 0.0.5
## v infer 0.5.1 v yardstick 0.0.4
## v parsnip 0.0.5
## -- Conflicts ----------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x dials::margin() masks ggplot2::margin()
## x dplyr::select() masks MASS::select()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## x recipes::yj_trans() masks scales::yj_trans()
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:parsnip':
##
## translate
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
##
## Attaching package: 'funModeling'
## The following object is masked from 'package:GGally':
##
## range01
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:dials':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
Memuat dataset
## Parsed with column specification:
## cols(
## X = col_double(),
## Y = col_double(),
## month = col_character(),
## day = col_character(),
## FFMC = col_double(),
## DMC = col_double(),
## DC = col_double(),
## ISI = col_double(),
## temp = col_double(),
## RH = col_double(),
## wind = col_double(),
## rain = col_double(),
## area = col_double()
## )
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 517 obs. of 13 variables:
## $ X : num 7 7 7 8 8 8 8 8 8 7 ...
## $ Y : num 5 4 4 6 6 6 6 6 6 5 ...
## $ month: chr "mar" "oct" "oct" "mar" ...
## $ day : chr "fri" "tue" "sat" "fri" ...
## $ FFMC : num 86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
## $ DMC : num 26.2 35.4 43.7 33.3 51.3 ...
## $ DC : num 94.3 669.1 686.9 77.5 102.2 ...
## $ ISI : num 5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
## $ temp : num 8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
## $ RH : num 51 33 33 97 99 29 27 86 63 40 ...
## $ wind : num 6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
## $ rain : num 0 0 0 0.2 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. X = col_double(),
## .. Y = col_double(),
## .. month = col_character(),
## .. day = col_character(),
## .. FFMC = col_double(),
## .. DMC = col_double(),
## .. DC = col_double(),
## .. ISI = col_double(),
## .. temp = col_double(),
## .. RH = col_double(),
## .. wind = col_double(),
## .. rain = col_double(),
## .. area = col_double()
## .. )
pada data tersebut terdapat 517 observasi dan 13 variabel. penjelasan untuk masing-masing variabel adalah :
X - x-axis spatial coordinate within the Montesinho park map: 1 to 9Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9month - month of the year: ‘jan’ to ‘dec’day - day of the week: ‘mon’ to ‘sun’FFMC - FFMC index dari FWI system: 18.7 to 96.20DMC - DMC index dari FWI system: 1.1 to 291.3DC - DC index dari FWI system: 7.9 to 860.6ISI - ISI index dari FWI system: 0.0 to 56.10temp - suhu in Celsius degrees: 2.2 to 33.30RH - kelembaban relatif in %: 15.0 to 100wind - kecepatan angin in km/h: 0.40 to 9.40rain - curah hujan in mm/m2 : 0.0 to 6.4area - area yang terbakar (in ha): 0.00 to 1090.84Sebelum memulai Exploratory data kita dapat mengecek nilai NA pada setiap variable
## X Y month day FFMC DMC DC ISI temp RH wind rain area
## 0 0 0 0 0 0 0 0 0 0 0 0 0
Kesimpulan dari pengecekan ini tidak terdapat missing value pada data
Untuk melihat gambaran dari datanya kita dapat menggunakan fungsi head pada data
## # A tibble: 1 x 13
## X Y month day FFMC DMC DC ISI temp RH wind rain area
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0 0
Memperbaiki nama bulan menjadi kapital dan sesuai ejaan yang disempurnakan serta mengatur level bulan agar sesuai dengan yang diinginkan
forest$month <- sapply(as.character(forest$month), switch,
"jan" = "January",
"feb" = "February",
"mar" = "March",
"apr" = "April",
"may" = "May",
"jun" = "June",
"jul" = "July",
"aug" = "August",
"sep" = "September",
"oct" = "October",
"nov" = "November",
"dec" = "December")
forest$month <- as.factor(forest$month)
str(forest)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 517 obs. of 13 variables:
## $ X : num 7 7 7 8 8 8 8 8 8 7 ...
## $ Y : num 5 4 4 6 6 6 6 6 6 5 ...
## $ month: Factor w/ 12 levels "April","August",..: 8 11 11 8 8 2 2 2 12 12 ...
## $ day : chr "fri" "tue" "sat" "fri" ...
## $ FFMC : num 86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
## $ DMC : num 26.2 35.4 43.7 33.3 51.3 ...
## $ DC : num 94.3 669.1 686.9 77.5 102.2 ...
## $ ISI : num 5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
## $ temp : num 8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
## $ RH : num 51 33 33 97 99 29 27 86 63 40 ...
## $ wind : num 6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
## $ rain : num 0 0 0 0.2 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. X = col_double(),
## .. Y = col_double(),
## .. month = col_character(),
## .. day = col_character(),
## .. FFMC = col_double(),
## .. DMC = col_double(),
## .. DC = col_double(),
## .. ISI = col_double(),
## .. temp = col_double(),
## .. RH = col_double(),
## .. wind = col_double(),
## .. rain = col_double(),
## .. area = col_double()
## .. )
forest$month <- factor(forest$month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
unique(forest$month)## [1] March October August September April June July
## [8] February January December May November
## 12 Levels: January February March April May June July August ... December
Memperbaiki nama hari menjadi kapital dan sesuai ejaan yang disempurnakan serta mengatur level bulan agar sesuai dengan yang diinginkan
forest$day <- sapply(as.character(forest$day), switch,
"sun" = "Sunday",
"mon" = "Monday",
"tue" = "Tuesday",
"wed" = "Wednesday",
"thu" = "Thursday",
"fri" = "Friday",
"sat" = "Saturday")
forest$day <- as.factor(forest$day)
str(forest)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 517 obs. of 13 variables:
## $ X : num 7 7 7 8 8 8 8 8 8 7 ...
## $ Y : num 5 4 4 6 6 6 6 6 6 5 ...
## $ month: Factor w/ 12 levels "January","February",..: 3 10 10 3 3 8 8 8 9 9 ...
## $ day : Factor w/ 7 levels "Friday","Monday",..: 1 6 3 1 4 4 2 2 6 3 ...
## $ FFMC : num 86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
## $ DMC : num 26.2 35.4 43.7 33.3 51.3 ...
## $ DC : num 94.3 669.1 686.9 77.5 102.2 ...
## $ ISI : num 5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
## $ temp : num 8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
## $ RH : num 51 33 33 97 99 29 27 86 63 40 ...
## $ wind : num 6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
## $ rain : num 0 0 0 0.2 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. X = col_double(),
## .. Y = col_double(),
## .. month = col_character(),
## .. day = col_character(),
## .. FFMC = col_double(),
## .. DMC = col_double(),
## .. DC = col_double(),
## .. ISI = col_double(),
## .. temp = col_double(),
## .. RH = col_double(),
## .. wind = col_double(),
## .. rain = col_double(),
## .. area = col_double()
## .. )
forest$day <- factor(forest$day, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
unique(forest$day)## [1] Friday Tuesday Saturday Sunday Monday Wednesday Thursday
## Levels: Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Mengecek hari dan bulan sudah terkoreksi dengan baik
## # A tibble: 3 x 13
## X Y month day FFMC DMC DC ISI temp RH wind rain area
## <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 5 March Frid~ 86.2 26.2 94.3 5.1 8.2 51 6.7 0 0
## 2 7 4 Octob~ Tues~ 90.6 35.4 669. 6.7 18 33 0.9 0 0
## 3 7 4 Octob~ Satu~ 90.6 43.7 687. 6.7 14.6 33 1.3 0 0
Mengecek variabel yang bisa ditoleransi untuk menjadi prediktor
## Warning in ggcorr(forest, label = T, hjust = 1): data in column(s) 'month',
## 'day' are not numeric and were ignored
Summary: Terdapat hubungan yang kuat antar prediktor DMC dan DC dengan nilai dari ggcorr sebesar 0.7 (mendekati 1) dan RH dengan temp sebesar -0.5 (mendekati -1)
perlu mengecek outlier yang ada pada variable area menggunakan histogram
Mengecek outlier setiap variable
mengecek variable bulan yang terdapat didata
forest_m <- forest %>%
group_by(month) %>%
summarise(Quantity=(n()),
Total_Area= (sum(area))) %>%
arrange(month)
forest_m## # A tibble: 12 x 3
## month Quantity Total_Area
## <fct> <int> <dbl>
## 1 January 2 0
## 2 February 20 126.
## 3 March 54 235.
## 4 April 9 80.0
## 5 May 2 38.5
## 6 June 17 99.3
## 7 July 32 460.
## 8 August 184 2298.
## 9 September 172 3086.
## 10 October 15 99.6
## 11 November 1 0
## 12 December 9 120.
Menghilangkan data yang tidak penting seperti bulan yang terjadi kebakarannya kecil (January, May, dan November) serta area yang terbakar > 100 Ha. Penghapusan data ini sangat penting dalam Regresi Linear karena dapat mempengaruhi nilai errornya.
forest_f <- forest %>%
select (X, Y, FFMC, DMC, DC, ISI, temp, RH, rain, wind, area, month, day) %>%
filter (area<=100) %>%
filter (month =="February" | month =="March" | month =="April" | month =="June" | month =="July" | month =="August" | month =="September" | month =="October" | month =="December")
head(forest_f,1)## # A tibble: 1 x 13
## X Y FFMC DMC DC ISI temp RH rain wind area month day
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 5 86.2 26.2 94.3 5.1 8.2 51 0 6.7 0 March Friday
melakukan pengecekan untuk melihat variable area yang sudah difilter
Membuat plot bulan vs area, untuk meihat pada bulan apa saja tingkat kebakaran hutan yang paling besar
forest_mth <- forest %>%
group_by(month) %>%
summarise(Quantity=(n()/10),
Total_Area= (sum(area)/100),
temperature = round(mean(temp),2),
RH = round(mean(RH),2),
rain = round(mean(rain),2),
wind = round(mean(wind),2)) %>%
arrange(month)
forest_m <- pivot_longer(forest_mth, cols = c("Quantity", "Total_Area"))
forest_m## # A tibble: 24 x 7
## month temperature RH rain wind name value
## <fct> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 January 5.25 89 0 2 Quantity 0.2
## 2 January 5.25 89 0 2 Total_Area 0
## 3 February 9.63 55.7 0 3.75 Quantity 2
## 4 February 9.63 55.7 0 3.75 Total_Area 1.25
## 5 March 13.1 40 0 4.97 Quantity 5.4
## 6 March 13.1 40 0 4.97 Total_Area 2.35
## 7 April 12.0 46.9 0 4.67 Quantity 0.9
## 8 April 12.0 46.9 0 4.67 Total_Area 0.800
## 9 May 14.6 67 0 4.45 Quantity 0.2
## 10 May 14.6 67 0 4.45 Total_Area 0.385
## # ... with 14 more rows
plot1 <- ggplot(forest_m, aes(x = month, y = value, text = paste("Value:", value))) +
geom_col(mapping = aes(fill = name), position = "dodge") +
labs(x = "Quantity (#)",
y = "",
fill = "Legends") +
# facet_grid(name~., scales = "free_y") +
scale_fill_discrete(aesthetics = "fill", labels = c("Quantity (#)","Area (ha)"),)
ggplotly(plot1, tooltip = "text")Membuat plot hari vs area, untuk meihat pada bulan apa saja tingkat kebakaran hutan yang paling besar, serta mengambil contoh pada bulan September, karena pada bulan september tingkat kebakaran hutan terjadi yang paling tinggi
forest_day <- forest %>%
filter(month == 'September') %>%
group_by(day) %>%
summarise(freq=n(), sum_area=sum(area)) %>%
ungroup()
forest_day## # A tibble: 7 x 3
## day freq sum_area
## <fct> <int> <dbl>
## 1 Sunday 27 378.
## 2 Monday 28 166.
## 3 Tuesday 19 501.
## 4 Wednesday 14 182.
## 5 Thursday 21 112.
## 6 Friday 38 201.
## 7 Saturday 25 1545.
# position dodge
forest_day <- forest %>%
filter(month == 'September') %>%
group_by(day) %>%
summarise(Quantity=n(), sum_area=sum(area)) %>%
ungroup()
forest_day <- pivot_longer(forest_day, cols = c("Quantity", "sum_area"))
forest_day## # A tibble: 14 x 3
## day name value
## <fct> <chr> <dbl>
## 1 Sunday Quantity 27
## 2 Sunday sum_area 378.
## 3 Monday Quantity 28
## 4 Monday sum_area 166.
## 5 Tuesday Quantity 19
## 6 Tuesday sum_area 501.
## 7 Wednesday Quantity 14
## 8 Wednesday sum_area 182.
## 9 Thursday Quantity 21
## 10 Thursday sum_area 112.
## 11 Friday Quantity 38
## 12 Friday sum_area 201.
## 13 Saturday Quantity 25
## 14 Saturday sum_area 1545.
plot2 <- ggplot(forest_day, aes(x = day, y = value, text = paste("value:", value))) +
geom_col(mapping = aes(fill = name), position = "dodge") +
labs(x = "",
y = "",
fill = "Legends") +
facet_grid(scales = "free_y") +
scale_fill_discrete(aesthetics = "fill", # state aes to costum
labels = c("Quantity (#)","Area (ha)"),)
ggplotly(plot2, tooltip = "text")Kita akan mencoba beberapa machine learning seperti Regression Linear, Decision Tree, dan Random Forest untuk memprediksi kebakaran hutan. Serta Metric Evaluation yang akan dipakai adalah RMSE atau Random Mean Square Error karena Metric tersebut cocok untuk data yang memiliki sedikit kategorikal dan target yang berupa numerikal.
Membagi data jadi data train dan test
Membuat model tanpa prediktor
Cek model dengan semua prediktor
Menggunakan Stepwise regression untuk mengecek nilai AIC
## Start: AIC=2138.75
## area ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + rain + wind +
## month + day
##
## Df Sum of Sq RSS AIC
## - day 6 637.48 74750 2130.2
## - month 8 2533.10 76645 2136.2
## - rain 1 35.71 74148 2136.9
## - FFMC 1 36.15 74148 2136.9
## - Y 1 40.22 74153 2137.0
## - X 1 99.80 74212 2137.3
## - ISI 1 149.40 74262 2137.6
## - RH 1 332.84 74445 2138.5
## <none> 74112 2138.8
## - DMC 1 430.83 74543 2139.1
## - wind 1 508.51 74621 2139.5
## - DC 1 555.77 74668 2139.7
## - temp 1 1323.97 75436 2143.8
##
## Step: AIC=2130.17
## area ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + rain + wind +
## month
##
## Df Sum of Sq RSS AIC
## - month 8 2615.50 77365 2127.9
## - Y 1 32.27 74782 2128.3
## - rain 1 35.14 74785 2128.4
## - FFMC 1 35.40 74785 2128.4
## - X 1 69.94 74820 2128.6
## - ISI 1 127.57 74877 2128.9
## <none> 74750 2130.2
## - DMC 1 398.04 75148 2130.3
## - RH 1 403.48 75153 2130.3
## - wind 1 531.13 75281 2131.0
## - DC 1 656.23 75406 2131.7
## - temp 1 1504.07 76254 2136.1
##
## Step: AIC=2127.93
## area ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + rain + wind
##
## Df Sum of Sq RSS AIC
## - Y 1 0.01 77365 2125.9
## - DMC 1 2.14 77367 2125.9
## - DC 1 4.26 77370 2125.9
## - rain 1 19.94 77385 2126.0
## - FFMC 1 34.09 77399 2126.1
## - X 1 40.68 77406 2126.1
## - RH 1 48.68 77414 2126.2
## - ISI 1 232.12 77597 2127.1
## <none> 77365 2127.9
## - temp 1 632.44 77998 2129.2
## - wind 1 776.64 78142 2129.9
##
## Step: AIC=2125.93
## area ~ X + FFMC + DMC + DC + ISI + temp + RH + rain + wind
##
## Df Sum of Sq RSS AIC
## - DMC 1 2.12 77367 2123.9
## - DC 1 4.44 77370 2123.9
## - rain 1 19.94 77385 2124.0
## - FFMC 1 34.09 77399 2124.1
## - RH 1 48.70 77414 2124.2
## - X 1 53.66 77419 2124.2
## - ISI 1 232.41 77598 2125.1
## <none> 77365 2125.9
## - temp 1 632.76 77998 2127.2
## - wind 1 781.45 78147 2127.9
##
## Step: AIC=2123.94
## area ~ X + FFMC + DC + ISI + temp + RH + rain + wind
##
## Df Sum of Sq RSS AIC
## - DC 1 10.42 77378 2122.0
## - rain 1 20.35 77388 2122.1
## - FFMC 1 38.73 77406 2122.1
## - X 1 53.30 77421 2122.2
## - RH 1 67.09 77435 2122.3
## - ISI 1 232.74 77600 2123.1
## <none> 77367 2123.9
## - temp 1 741.24 78109 2125.8
## - wind 1 800.39 78168 2126.1
##
## Step: AIC=2122
## area ~ X + FFMC + ISI + temp + RH + rain + wind
##
## Df Sum of Sq RSS AIC
## - rain 1 22.12 77400 2120.1
## - X 1 49.15 77427 2120.2
## - FFMC 1 52.35 77430 2120.3
## - RH 1 101.25 77479 2120.5
## - ISI 1 246.50 77624 2121.3
## <none> 77378 2122.0
## - wind 1 789.98 78168 2124.1
## - temp 1 1021.75 78400 2125.2
##
## Step: AIC=2120.11
## area ~ X + FFMC + ISI + temp + RH + wind
##
## Df Sum of Sq RSS AIC
## - X 1 44.73 77445 2118.3
## - FFMC 1 49.07 77449 2118.4
## - RH 1 88.28 77488 2118.6
## - ISI 1 243.78 77644 2119.4
## <none> 77400 2120.1
## - wind 1 773.60 78174 2122.1
## - temp 1 999.63 78400 2123.2
##
## Step: AIC=2118.34
## area ~ FFMC + ISI + temp + RH + wind
##
## Df Sum of Sq RSS AIC
## - FFMC 1 45.84 77491 2116.6
## - RH 1 95.31 77540 2116.8
## - ISI 1 238.89 77684 2117.6
## <none> 77445 2118.3
## - wind 1 766.05 78211 2120.3
## - temp 1 983.03 78428 2121.4
##
## Step: AIC=2116.58
## area ~ ISI + temp + RH + wind
##
## Df Sum of Sq RSS AIC
## - RH 1 86.34 77577 2115.0
## - ISI 1 194.48 77685 2115.6
## <none> 77491 2116.6
## - wind 1 747.18 78238 2118.4
## - temp 1 1109.10 78600 2120.3
##
## Step: AIC=2115.02
## area ~ ISI + temp + wind
##
## Df Sum of Sq RSS AIC
## - ISI 1 171.80 77749 2113.9
## <none> 77577 2115.0
## - wind 1 716.97 78294 2116.7
## - temp 1 1110.55 78687 2118.7
##
## Step: AIC=2113.91
## area ~ temp + wind
##
## Df Sum of Sq RSS AIC
## <none> 77749 2113.9
## - wind 1 604.45 78353 2115.0
## - temp 1 938.92 78688 2116.7
Cek nilai r-squared untuk ff.back
##
## Call:
## lm(formula = area ~ temp + wind, data = forest_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.699 -6.699 -4.834 -0.023 87.009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7019 3.2920 -0.517 0.6055
## temp 0.2753 0.1257 2.190 0.0291 *
## wind 0.7160 0.4075 1.757 0.0797 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.99 on 397 degrees of freedom
## Multiple R-squared: 0.01588, Adjusted R-squared: 0.01092
## F-statistic: 3.202 on 2 and 397 DF, p-value: 0.04172
Predict model dengan hasil test, dengan parameter confidence interval
Predict model dengan hasil test, dengan parameter prediction interval
Mendapatkan nilai RMSE
## [1] 13.01183
## [1] 25.96001
jJika menggunakan machine learning Regresssion Linear didapatkan nilai Error sebesar 13.0 yang berarti nilai error tersebut mendapatkan lebih kurang 13.0 hektar
Cross validation untuk machine learning random forest
# define preprocess recipe from train dataset
rec <- recipe(area ~ ., data = training(splitted)[,-c(1:2)]) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()# prepare recipes-revert functions
rec_rev <- function(x, rec) {
means <- rec$steps[[2]]$means[["area"]]
sds <- rec$steps[[3]]$sds[["area"]]
x <- (x * sds + means) ^ 2
x
}# preprocess train and test dataset
data_train <- juice(rec)
data_test <- bake(rec, testing(splitted))
# quick check
head(data_train, 10)## # A tibble: 10 x 11
## FFMC DMC DC ISI temp RH rain wind month day
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 -0.879 -1.46 -1.76 -0.882 -1.93 0.526 -0.105 1.37 March Frid~
## 2 -0.00580 -1.21 0.520 -0.438 -0.0491 -0.658 -0.105 -2.14 Octo~ Tues~
## 3 -0.00580 -1.00 0.569 -0.438 -0.623 -0.658 -0.105 -1.73 Octo~ Satu~
## 4 -0.261 -0.834 -1.71 0.246 -1.23 2.91 -0.105 -1.30 March Sund~
## 5 0.326 -0.196 -0.0138 1.23 0.589 -0.963 -0.105 0.802 Augu~ Sund~
## 6 0.170 0.672 0.350 0.478 -1.97 2.33 -0.105 -0.995 Augu~ Mond~
## 7 0.0725 0.463 0.584 -0.361 -0.897 1.20 -0.105 0.802 Sept~ Tues~
## 8 0.365 -0.152 0.600 -0.336 0.675 -0.166 -0.105 0.110 Sept~ Satu~
## 9 0.365 -0.152 0.600 -0.336 -0.0812 0.526 -0.105 1.57 Sept~ Satu~
## 10 -5.79 -0.449 0.510 -2.71 -0.212 1.67 -0.105 1.37 Augu~ Frid~
## # ... with 1 more variable: area <dbl>
# define model specification
model_spec <- rand_forest(
mode = "regression",
mtry = ncol(data_train) - 1,
trees = 500,
min_n = 1
)# define model engine
model_engine <- set_engine(
object = model_spec,
engine = "ranger",
seed = 100,
num.threads = parallel::detectCores() / 2,
importance = "impurity"
)
# quick check
model_engine## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = ncol(data_train) - 1
## trees = 500
## min_n = 1
##
## Engine-Specific Arguments:
## seed = 100
## num.threads = parallel::detectCores()/2
## importance = impurity
##
## Computational engine: ranger
# fit the model
model <- fit_xy(
object = model_engine,
x = dplyr::select(data_train, -area),
y = dplyr::select(data_train, area)
)
# quick check
model## parsnip model object
##
## Fit time: 130ms
## Ranger result
##
## Call:
## ranger::ranger(formula = formula, data = data, mtry = ~ncol(data_train) - 1, num.trees = ~500, min.node.size = ~1, seed = ~100, num.threads = ~parallel::detectCores()/2, importance = ~"impurity", verbose = FALSE)
##
## Type: Regression
## Number of trees: 500
## Sample size: 401
## Number of independent variables: 10
## Mtry: 10
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 1.125203
## R squared (OOB): -0.125203
coord <- testing(splitted)[,1:2]
# predict on test
pred_test <- dplyr::select(data_test, area) %>%
bind_cols(predict(model, data_test)) %>%
mutate_all(~ rec_rev(., rec = rec)) %>%
cbind(coord) %>%
mutate(rmse = rmse_vec(truth = area, estimate = .pred)) %>%
rename("Area_Prediction" = .pred) %>%
mutate(Category = case_when(Area_Prediction <= 20 ~ "Low",
Area_Prediction >= 20 & Area_Prediction <=40 ~ "Medium",
Area_Prediction >= 40 ~ "High")) %>%
mutate(Category = as.factor(Category))## area Area_Prediction X Y rmse Category
## 1 0.68 5.098211 4 3 11.60533 Low
## 2 0.00 5.581307 7 5 11.60533 Low
## 3 4.95 5.750346 3 4 11.60533 Low
mengecek nilai error untuk machine learning random forest
# calculate error and rsquared
rmse_rf <- pred_test %>%
summarise(rmse = rmse_vec(truth = area, estimate = Area_Prediction))
rmse_rf$rmse## [1] 11.60533
Jika menggunakan random Forest didapatkan nilai Error sebesar 11.6 yang berarti nilai error tersebut mendapatkan lebih kurang 11.6 hektar
Membuat model menggunakan Decision Tree Regression
##
## Regression tree:
## rpart(formula = area ~ ., data = forest_train, method = "anova")
##
## Variables actually used in tree construction:
## [1] day ISI temp
##
## Root node error: 79003/400 = 197.51
##
## n= 400
##
## CP nsplit rel error xerror xstd
## 1 0.053846 0 1.00000 1.0079 0.20390
## 2 0.045397 2 0.89231 1.1434 0.21105
## 3 0.010000 3 0.84691 1.1362 0.19518
##
## Regression tree:
## rpart(formula = area ~ ., data = forest_train, method = "anova")
##
## Variables actually used in tree construction:
## [1] day ISI temp
##
## Root node error: 79003/400 = 197.51
##
## n= 400
##
## CP nsplit rel error xerror xstd
## 1 0.053846 0 1.00000 1.0079 0.20390
## 2 0.045397 2 0.89231 1.1434 0.21105
## 3 0.010000 3 0.84691 1.1362 0.19518
Menguji model Decision Tree yang sudah dibuat dengan cara diuji prediksi pada data test
Mengecek nilai rmse dari model.fit
## [1] 13.5371
Jika menggunakan mahine learning Decision Tree didapatkan nilai Error sebesar 13.5 yang berarti nilai error tersebut mendapatkan lebih kurang 13.5 hektar
Berikut ini kita akan menampilkan area prediksi yang berpeluang terkena kebakaran hutan, titik tersebut dapat ditempilkan dengan peta taman nasional montensinho
Memuat peta dasar
Memuat titik prediksi kebakaran hutan
Menggabungkan titik perdiksi dengan peta dasar
plot4 <- plot3 +
background_image(map) +
geom_point(aes(colour = Category), size=pred_test$Area_Prediction, alpha= 0.6) +
scale_x_continuous(limits = c(0,9), breaks = seq(0,9,1)) +
scale_y_continuous(limits = c(0,9), breaks = seq(0,9,1)) +
scale_y_reverse() +
labs(title="Forest Fire Prediction Map", caption="source: Montensinho National Park") ## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
Berdasarkan Metric yang sudah ditentukan maka dapat dipilih model prediksi menggunakan Random Forest, karena memberikan hasil yang lebih baik dengan nilai RMSE sebesar 11.6. Maka dari itu model yang terbaik untuk prediksi kebakaran hutan berdasarkan area sebaiknya menggunakan model dari Random Forest.