First I am going to gather the data from wikipedia with the rvest library.
html <- read_html(url)
Let’s take a look at the html and see what we have gathered.
html
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
Look’s like we got something. Let’s see if we can extract the tables from the webpage.
html %>%
html_table()
## [[1]]
## # A tibble: 13 × 4
## X1 X2 X3 X4
## <chr> <chr> <chr> <chr>
## 1 ".mw-parser-output .loc… ".mw-parser-output .loc… "4.0−5.9 mag… "7.0−7.9 mag…
## 2 "4.0−5.9 magnitude\n 6.… "7.0−7.9 magnitude\n 8.… <NA> <NA>
## 3 "Strongest magnitude" "8.2 Mw United States" <NA> <NA>
## 4 "Deadliest" "7.2 Mw Haiti 2,248 de… <NA> <NA>
## 5 "Total fatalities" "2,406" <NA> <NA>
## 6 "Number by magnitude" "Number by magnitude" <NA> <NA>
## 7 "9.0+" "0" <NA> <NA>
## 8 "8.0−8.9" "3" <NA> <NA>
## 9 "7.0−7.9" "12" <NA> <NA>
## 10 "6.0−6.9" "98" <NA> <NA>
## 11 "5.0−5.9" "1,524" <NA> <NA>
## 12 "4.0−4.9" "9,450" <NA> <NA>
## 13 "← 2020" "← 2020" <NA> <NA>
##
## [[2]]
## # A tibble: 1 × 2
## X1 X2
## <chr> <chr>
## 1 "4.0−5.9 magnitude\n 6.0−6.9 magnitude" "7.0−7.9 magnitude\n 8.0+ magnitude"
##
## [[3]]
## # A tibble: 6 × 12
## Magnitude `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 8.0–9.9 1 2 2 1 1 0 1 1 1
## 2 7.0–7.9 19 14 17 11 18 16 6 16 9
## 3 6.0–6.9 187 117 123 143 124 127 104 118 135
## 4 5.0–5.9 2,486 1,546 1,460 1,580 1,413 1,550 1,447 1,671 1,484
## 5 4.0–4.9 13,129 10,955 11,877 15,817 13,777 13,700 10,544 12,782 11,897
## 6 Total 15,822 12,635 13,480 17,552 15,336 15,397 13,102 14,589 13,530
## # … with 2 more variables: 2020 <chr>, 2021 <chr>
##
## [[4]]
## # A tibble: 4 × 8
## Rank `Death toll` Magnitude Location MMI `Depth (km)` Date Event
## <int> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 1 2,248 7.2 Haiti, Nipp… IX (V… 10 Augus… 2021 Hai…
## 2 2 108 6.2 Indonesia, … VIII … 18 Janua… 2021 Wes…
## 3 3 11 7 Mexico, Gue… VIII … 20 Septe… 2021 Gue…
## 4 4 10 6 Indonesia, … V (Mo… 67 April… 2021 Eas…
##
## [[5]]
## # A tibble: 15 × 8
## Rank Magnitude `Death toll` Location MMI `Depth (km)` Date Event
## <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 8.2 0 United State… VII (… 32.2 July… 2021 Ch…
## 2 2 8.1 0 New Zealand,… VIII … 26.5 Marc… 2021 Ke…
## 3 2 8.1 0 South Georgi… VII (… 55.7 Augu… 2021 So…
## 4 4 7.7 0 New Caledoni… IV (L… 10 Febr… 2021 Lo…
## 5 5 7.5 0 South Georgi… VI (S… 63.3 Augu… 2021 So…
## 6 6 7.4 0 New Zealand,… VII (… 55.6 Marc… 2021 Ke…
## 7 7 7.3 0 China, Qingh… IX (V… 10 May … 2021 Ma…
## 8 7 7.3 0 New Zealand,… VI (S… 10 Marc… -
## 9 9 7.2 2,248 Haiti, Nippes IX (V… 10 Augu… 2021 Ha…
## 10 10 7.1 1 Japan, Miyag… VIII … 49.9 Febr… 2021 Fu…
## 11 10 7.1 1 Philippines,… VII (… 65.6 Augu… 2021 Da…
## 12 10 7.1 0 South Georgi… IV (L… 14 Augu… 2021 So…
## 13 13 7 11 Mexico, Guer… VIII … 12.6 Sept… 2021 Gu…
## 14 13 7 0 Japan, Miyag… VII (… 54 Marc… March 2…
## 15 13 7 0 Philippines … VI (S… 80 Janu… 2021 Ta…
##
## [[6]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude 7.0 Mw Philippines
## 2 Deadliest 6.2 Mw Indonesia 108 deaths
## 3 Total fatalities 108
## 4 Number by magnitude Number by magnitude
## 5 8.0−8.9 0
## 6 7.0−7.9 1
## 7 6.0−6.9 11
## 8 5.0−5.9 136
## 9 4.0−4.9 1,021
##
## [[7]]
## # A tibble: 20 × 8
## Date `Country and loc… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loca… Mw Depth (km) MMI Notes Dead Injured
## 2 3[1] United States, A… 6.1 21.0 VI - - -
## 3 6[2] New Zealand offs… 6.2 37.1 IV - - -
## 4 6[3] Croatia, Sisak-M… 4.7 10.0 VIII It w… - -
## 5 6[5] Indonesia, Goron… 6.1 148.0 IV - - -
## 6 8[6] New Zealand offs… 6.3 224.0 IV - - -
## 7 8[7] Vanuatu, Tafea o… 6.1 113.0 IV - - -
## 8 10[8] Argentina, Jujuy… 6.1 217.0 IV - - -
## 9 10[9] Vanuatu, Malampa… 6.1 160.0 IV - - -
## 10 10[10] Turkey, Ankara, … 4.3 10.0 IV Seve… - -
## 11 11[12] Mongolia, Khövsg… 6.7 10.0 VIII The … - 53
## 12 14[15] Indonesia, West … 5.7 18.0 VII It w… - 1
## 13 14[17] Indonesia, West … 6.2 18.0 VIII The … 108 3,369
## 14 15[20] Iran, Hormozgan,… 5.5 8.0 VII Abou… - 1
## 15 19[22] Argentina, San J… 6.4 16.9 VII Vari… - 14
## 16 21[25] Philippines offs… 7.0 80.0 VI In I… - -
## 17 23[27] Spain, Andalusia… 4.2 10.0 IV It w… - 1
## 18 23[29] Antarctica offsh… 6.9 9.8 V Smal… - -
## 19 28[32] Spain, Andalusia… 4.3 10.0 IV It w… - -
## 20 31[34] Guyana, Upper Ta… 5.5 5.4 VIII It w… - -
##
## [[8]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude "7.7 Mw New Caledonia"
## 2 Deadliest "7.1 Mw Japan1 death5.9 Mw Tajikistan1 death\n4.9 Mw I…
## 3 Total fatalities "3"
## 4 Number by magnitude "Number by magnitude"
## 5 8.0−8.9 "0"
## 6 7.0−7.9 "2"
## 7 6.0−6.9 "14"
## 8 5.0−5.9 "253"
## 9 4.0−4.9 "1,182"
##
## [[9]]
## # A tibble: 24 × 8
## Date `Country and loc… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loca… Mw Depth (km) MMI Notes Dead Injured
## 2 3[38] West Chile Rise 6.7 10.0 IV - - -
## 3 3[39] Indonesia, West … 4.9 10.0 III It w… 1 -
## 4 7[41] Philippines, Dav… 6.1 10.0 VII The … - 14
## 5 7[43] Papua New Guinea… 6.3 10.0 IV - - -
## 6 10[44] New Caledonia of… 6.1 11.0 - It w… - -
## 7 10[45] Indonesia, Bengk… 6.3 10.0 IV - - -
## 8 10[46] New Caledonia of… 6.1 10.0 - It w… - -
## 9 10[47] New Caledonia of… 7.7 10.0 IV The … - -
## 10 10[51] New Caledonia of… 6.1 11.7 - Thes… - -
## # … with 14 more rows
##
## [[10]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude 8.1 Mw New Zealand
## 2 Deadliest 5.4 Mw China3 deaths5.1 Mw Colombia3 deaths
## 3 Total fatalities 7
## 4 Number by magnitude Number by magnitude
## 5 8.0−8.9 1
## 6 7.0−7.9 3
## 7 6.0−6.9 14
## 8 5.0−5.9 340
## 9 4.0−4.9 1,745
##
## [[11]]
## # A tibble: 25 × 8
## Date `Country and loc… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loca… Mw Depth (km) MMI Notes Dead Injured
## 2 1[77] Colombia, Antioq… 5.1 10.0 IX Vario… 3 6
## 3 3[81] Greece, Thessali… 6.3 8.0 VIII The 2… 1 11
## 4 4[85] New Zealand, Gis… 7.3 10.0 VI A tsu… - -
## 5 4[89] Vanuatu, Torba o… 6.1 173.3 IV - - -
## 6 4[90] New Zealand, Ker… 7.4 43.0 VII It wa… - -
## 7 4[92] New Zealand, Ker… 8.1 21.2 VIII The 2… - -
## 8 4[96] New Zealand, Ker… 6.1 10.0 III It wa… - -
## 9 4[97] Greece, Thessali… 5.8 10.0 VIII It wa… - -
## 10 4[99] New Zealand, Ker… 6.5 10.0 III Those… - -
## # … with 15 more rows
##
## [[12]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude 6.6 Mw South Sandwich Islands
## 2 Deadliest 6.0 Mw Indonesia10 deaths
## 3 Total fatalities 12
## 4 Number by magnitude Number by magnitude
## 5 8.0−8.9 0
## 6 7.0−7.9 0
## 7 6.0−6.9 14
## 8 5.0−5.9 131
## 9 4.0−4.9 1,099
##
## [[13]]
## # A tibble: 20 × 8
## Date `Country and lo… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loc… Mw Depth (km) MMI Notes Dead Injured
## 2 1[123] New Zealand, Ke… 6.5 20.0 IV It w… - -
## 3 1[124] Algeria, Guelma… 4.8 10.0 VII Some… - -
## 4 1[126] Tonga offshore,… 6.0 595.0 II - - -
## 5 3[127] South Georgia a… 6.6 10.0 I - - -
## 6 5[128] New Zealand, Gi… 6.1 10.0 V It w… - -
## 7 6[129] Iraq, Sulaymani… 5.2 9.3 VII Some… - 4
## 8 7[131] New Zealand, Ke… 6.1 10.0 III It w… - -
## 9 10[132] Indonesia, East… 6.0 67.0 V Duri… 10 104
## 10 10[135] Philippines off… 6.1 311.3 III - - -
## 11 10[136] Papua New Guine… 6.0 10.0 IV - - -
## 12 18[137] Iran, Bushehr, … 5.8 8.0 VIII Crac… - 5
## 13 18[139] Taiwan, Hualien… 5.8 12.0 VIII Mino… - -
## 14 20[141] Indonesia, Nort… 6.1 9.0 IV - - -
## 15 24[142] Tonga, Ha'apai … 6.5 301.0 IV Thes… - -
## 16 25[143] Tonga, Tongatap… 6.5 246.0 IV Thes… - -
## 17 27[144] Papua New Guine… 6.1 10.0 IV - - -
## 18 27[145] Indonesia, West… 5.0 57.4 IV Two … - -
## 19 28[147] India, Assam, 9… 6.0 34.0 VII The … 2 12
## 20 29[151] New Zealand, Ke… 6.1 10.0 III It w… - -
##
## [[14]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude 7.3 Mw China
## 2 Deadliest 6.1 Mw China3 deaths
## 3 Total fatalities 3
## 4 Number by magnitude Number by magnitude
## 5 8.0−8.9 0
## 6 7.0−7.9 1
## 7 6.0−6.9 12
## 8 5.0−5.9 129
## 9 4.0−4.9 835
##
## [[15]]
## # A tibble: 20 × 8
## Date `Country and lo… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loc… Mw Depth (km) MMI Notes Dead Injured
## 2 1[152] Japan, Miyagi o… 6.9 47.3 VII 3 pe… - 3
## 3 2[154] Indonesia, West… 5.5 21.0 V An e… - -
## 4 5[156] Indonesia, West… 5.7 23.1 V The … - -
## 5 7[157] Australia offsh… 6.0 10.0 - - - -
## 6 7[158] Fiji, Lau offsh… 6.1 378.6 III - - -
## 7 12[159] El Salvador off… 6.0 22.0 IV - - -
## 8 12[160] Mauritius - Reu… 6.7 10.0 III - - -
## 9 13[161] Panama, Chiriqu… 6.0 10.0 IV - - -
## 10 13[162] Japan, Miyagi o… 6.0 32.0 IV It w… - -
## 11 14[164] Indonesia, Nort… 6.7 11.0 IV - - -
## 12 17[165] Iran, North Kho… 5.4 10.0 VII Some… - 25
## 13 18[167] Nepal, Gandaki,… 5.3 10.0 V Arou… - 6
## 14 19[169] Southern East P… 6.7 10.0 I - - -
## 15 21[170] China, Yunnan, … 6.1 9.0 VIII The … 3 32
## 16 21[172] China, Southern… 7.3 10.0 IX Seve… - 19
## 17 21[175] France, Wallis … 6.5 10.0 VII - - -
## 18 21[176] Indonesia, East… 5.7 107.9 IV More… - 3
## 19 25[178] Rwanda, Western… 4.7 10.0 VII Hous… - 3
## 20 31[181] United States, … 6.1 43.9 VI Mino… - -
##
## [[16]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude 6.5 Mw New Zealand
## 2 Deadliest 5.0 Mw Democratic Republic of the Congo2 deaths
## 3 Total fatalities 3
## 4 Number by magnitude Number by magnitude
## 5 8.0−8.9 0
## 6 7.0−7.9 0
## 7 6.0−6.9 2
## 8 5.0−5.9 100
## 9 4.0−4.9 791
##
## [[17]]
## # A tibble: 9 × 8
## Date `Country and loc… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loca… Mw Depth (km) MMI Notes Dead Injured
## 2 3[183] Indonesia offsho… 6.2 9.9 IV - - -
## 3 10[184] Democratic Repub… 5.0 10.0 V Vari… 2 3
## 4 10[187] China, Yunnan, 1… 5.0 10.0 II Two … - 2
## 5 16[189] Indonesia, Maluk… 5.9 5.7 VI Home… - -
## 6 20[193] New Zealand, Ker… 6.5 10.0 IV It w… - -
## 7 23[194] Peru, Cañete off… 5.9 49.5 VII Duri… 1 20
## 8 23[199] Argentina, Mendo… 4.6 10.0 IV Some… - -
## 9 26[201] Turkey, Bingöl, … 5.4 10.0 VI Seve… - 1
##
## [[18]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude 8.2 Mw United States
## 2 Deadliest 5.7 Mw Tajikistan5 deaths
## 3 Total fatalities 6
## 4 Number by magnitude Number by magnitude
## 5 8.0−8.9 1
## 6 7.0−7.9 0
## 7 6.0−6.9 12
## 8 5.0−5.9 135
## 9 4.0−4.9 1123
##
## [[19]]
## # A tibble: 16 × 8
## Date `Country and lo… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loc… Mw Depth (km) MMI Notes Dead Injured
## 2 2[203] Fiji region 6.1 599.6 II - - -
## 3 4[204] Chile, Atacama … 6.0 24.0 VI Thes… - -
## 4 4[205] Chile, Atacama … 6.0 22.0 VI Thes… - -
## 5 8[206] United States, … 6.0 7.5 VII The … - -
## 6 10[208] Indonesia offsh… 6.1 43.6 IV - - -
## 7 10[209] Tajikistan, Dis… 5.7 12.8 VII Five… 5 30
## 8 18[212] Panama, Chiriqu… 6.1 9.1 IV It w… - -
## 9 21[213] Papua New Guine… 6.0 8.7 IV - - -
## 10 21[214] Panama, Chiriqu… 6.7 10.0 VI Powe… - -
## 11 23[216] Philippines, Ca… 6.7 110.0 V Some… - -
## 12 24[219] New Zealand, Ke… 6.1 10.0 III It w… - -
## 13 26[220] Indonesia, Cent… 6.3 11.0 VI Powe… 1 -
## 14 29[224] United States, … 8.2 35.0 VII The … - -
## 15 29[227] Myanmar, Sagain… 5.5 10.0 VII A br… - -
## 16 30[230] Peru, Piura, 9 … 6.2 34.4 VII Mode… - 721
##
## [[20]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude "8.1 Mw South Sandwich Islands"
## 2 Deadliest "7.2 Mw Haiti2,248\ndeaths"
## 3 Total fatalities "2,250"
## 4 Number by magnitude "Number by magnitude"
## 5 8.0−8.9 "1"
## 6 7.0−7.9 "4"
## 7 6.0−6.9 "15"
## 8 5.0−5.9 "296"
## 9 4.0−4.9 "955"
##
## [[21]]
## # A tibble: 21 × 8
## Date `Country and lo… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loc… Mw Depth (km) MMI Notes Dead Injured
## 2 3[233] India offshore,… 6.1 10.0 IV - - -
## 3 11[234] Philippines, Da… 7.1 65.6 VII The … 1 -
## 4 12[236] South Georgia a… 7.5 47.2 VI The … - -
## 5 12[237] South Georgia a… 8.1 55.7 VII The … - -
## 6 12[238] South Georgia a… 6.7 35.0 - Thes… - -
## 7 12[239] South Georgia a… 6.1 35.0 IV Thes… - -
## 8 12[240] Spain, Andalusi… 4.6 10.0 VII It w… - -
## 9 13[242] South Georgia a… 6.1 10.0 IV It w… - -
## 10 14[243] United States, … 6.9 21.0 IV It w… - -
## # … with 11 more rows
##
## [[22]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 Strongest magnitude "7.0 Mw Mexico"
## 2 Deadliest "7.0 Mw Mexico11\ndeaths"
## 3 Total fatalities "14"
## 4 Number by magnitude "Number by magnitude"
## 5 8.0−8.9 "0"
## 6 7.0−7.9 "1"
## 7 6.0−6.9 "5"
## 8 5.0−5.9 "22"
## 9 4.0−4.9 "69"
##
## [[23]]
## # A tibble: 10 × 8
## Date `Country and lo… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Date Country and loc… Mw Depth (km) MMI Notes Dead Injured
## 2 7[260] Tonga, Haʻapai … 6.0 10.0 IV - - -
## 3 8[261] Mexico, Guerrer… 7.0 20.0 VIII The … 11 23
## 4 13[274] Iran, Razavi Kh… 5.1 10.0 V Buil… - 10
## 5 13[276] Argentina, Salt… 6.2 193.4 IV - - -
## 6 15[277] China, Sichuan,… 5.4 10.0 VII Thre… 3 146
## 7 20[281] Russia, Kuril I… 6.0 25.0 IV - - -
## 8 21[282] Chile, Biobío o… 6.4 12.5 V - - -
## 9 21[283] Australia, Vict… 5.9 10.0 VII The … - -
## 10 22[285] Nicaragua, offs… 6.5 30.7 V - - -
##
## [[24]]
## # A tibble: 10 × 2
## `.mw-parser-output .navbar{display:in… `.mw-parser-output .navbar{display:in…
## <chr> <chr>
## 1 "January" "West Sulawesi, Indonesia (6.2, Jan 1…
## 2 "February" "Mindanao, Philippines (6.0, Feb 7)\n…
## 3 "March" "Antioquia, Colombia (5.1, Mar 1)\nLa…
## 4 "April" "East Java, Indonesia (6.0, Apr 10)\n…
## 5 "May" "Yunnan, China (6.1, May 21)\nQinghai…
## 6 "June" "Mala, Peru (5.8, June 23)"
## 7 "July" "Antelope Valley, California (6.0, Ju…
## 8 "August" "Davao Oriental, Philippines (7.1, Au…
## 9 "September" "Guerrero, Mexico (7.0, Sep 7)\nMelbo…
## 10 "† indicates earthquake resulting in … "† indicates earthquake resulting in …
##
## [[25]]
## # A tibble: 4 × 2
## `vteEarthquakes by year` `vteEarthquakes by year`
## <chr> <chr>
## 1 "19th century" "1900"
## 2 "20th century" "1901\n1902\n1903\n1904\n1905\…
## 3 "21st century" "2001\n2002\n2003\n2004\n2005\…
## 4 "Historical earthquakes\nLists of earthquakes" "Historical earthquakes\nLists…
That was a lot of tables! Let’s dig in and see if we can extract the monthly ones. I think the seventh one is January’s.
tables <- html %>%
html_table(header = TRUE)
tables[[7]][-1,]
## # A tibble: 19 × 8
## Date `Country and loc… Mw `Depth (km)` MMI Notes Casualties Casualties
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3[1] United States, A… 6.1 21.0 VI - - -
## 2 6[2] New Zealand offs… 6.2 37.1 IV - - -
## 3 6[3] Croatia, Sisak-M… 4.7 10.0 VIII It w… - -
## 4 6[5] Indonesia, Goron… 6.1 148.0 IV - - -
## 5 8[6] New Zealand offs… 6.3 224.0 IV - - -
## 6 8[7] Vanuatu, Tafea o… 6.1 113.0 IV - - -
## 7 10[8] Argentina, Jujuy… 6.1 217.0 IV - - -
## 8 10[9] Vanuatu, Malampa… 6.1 160.0 IV - - -
## 9 10[10] Turkey, Ankara, … 4.3 10.0 IV Seve… - -
## 10 11[12] Mongolia, Khövsg… 6.7 10.0 VIII The … - 53
## 11 14[15] Indonesia, West … 5.7 18.0 VII It w… - 1
## 12 14[17] Indonesia, West … 6.2 18.0 VIII The … 108 3,369
## 13 15[20] Iran, Hormozgan,… 5.5 8.0 VII Abou… - 1
## 14 19[22] Argentina, San J… 6.4 16.9 VII Vari… - 14
## 15 21[25] Philippines offs… 7.0 80.0 VI In I… - -
## 16 23[27] Spain, Andalusia… 4.2 10.0 IV It w… - 1
## 17 23[29] Antarctica offsh… 6.9 9.8 V Smal… - -
## 18 28[32] Spain, Andalusia… 4.3 10.0 IV It w… - -
## 19 31[34] Guyana, Upper Ta… 5.5 5.4 VIII It w… - -
Let’s now create our eventual final dataframe and make certain we add a column highlighting what month this happened in.
jan <- tables[[7]][-1,]
df <- jan %>% add_column(Month = 'January')
## Warning: The `.data` argument of `add_column()` must have unique names as of tibble 3.0.0.
## Use `.name_repair = "minimal"`.
Let’s automate the rest. It is currently August so there is only 8 months of data.
monthlist = c('January','February','March','April','May','June','July','August','September')
counter = c(2:9)
for (number in counter){
newtable <- tables[[5+2*number]][-1,] #the minus one is to get rid of the title of the columns
newtable <- newtable %>% add_column(Month = monthlist[number])
df <- bind_rows(df,newtable)
}
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
head(df)
## # A tibble: 6 × 9
## Date `Country and locatio… Mw `Depth (km)` MMI Notes Casualties...7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3[1] United States, Alask… 6.1 21.0 VI - -
## 2 6[2] New Zealand offshore… 6.2 37.1 IV - -
## 3 6[3] Croatia, Sisak-Mosla… 4.7 10.0 VIII It was an… -
## 4 6[5] Indonesia, Gorontalo… 6.1 148.0 IV - -
## 5 8[6] New Zealand offshore… 6.3 224.0 IV - -
## 6 8[7] Vanuatu, Tafea offsh… 6.1 113.0 IV - -
## # … with 2 more variables: Casualties...8 <chr>, Month <chr>
I am going to need to do some cleaning now that I have gathered the data. In the Date column there is a reference hyperlink in square brackets. Let’s get rid of that and rename it Day
df <- df %>%
mutate(
Day = str_remove_all(Date,pattern = "\\[*[0-9]*\\]")
)
Next I’d like to change all the casualty entries from NA to zero.
df <- df %>%
mutate(
Deaths = str_replace_all(Casualties...7,"\\-","0"),
Injuries = str_replace_all(Casualties...8,"\\-","0")
)
They were not actually NA but were -. Tricky!
Let’s continue looking to extract some information about country and location
df<- df %>%
mutate(
Offshore = str_detect(`Country and location`,"offshore"),
Country = str_extract(`Country and location`,'[A-z]+')
)
A couple of these are wrong, I’ll fix the obvious ones and come back if I see others I missed later.
df <- df %>%
mutate(
Country = replace(Country, Country == 'United','United States'),
Country = replace(Country, Country == 'South', 'South Georgia'),
Country = replace(Country, Country == 'Papua','Papua New Guinea')
)
df <- df %>%
mutate(
Country = replace(Country, str_detect(`Country and location`,'New Zealand'),'New Zealand'),
Country = replace(Country, str_detect(`Country and location`,'New Cal'),'New Caledonia')
)
df <- df %>%
rename(Depth = 'Depth (km)')
Okay I think I have extracted all the info I am going to. I’ll clean up the dataset and organize it.
df <- df %>%
select(c(Month,Day,Country,Deaths,Injuries,Mw,Depth,MMI,Offshore))
df <- df %>%
mutate(
Deaths = str_remove_all(Deaths,","),
Injuries = str_remove_all(Injuries, ",")
)
df <- df %>%
mutate(Deaths = as.integer(Deaths),
Injuries = as.integer(Injuries),
Depth = as.numeric(Depth),
Day = as.integer(Day),
Mw = as.numeric(Mw))
df <- df %>%
drop_na()
df %>%
summarize(Average_Deaths = mean(Deaths),
Average_Injuries = mean(Injuries))
## # A tibble: 1 × 2
## Average_Deaths Average_Injuries
## <dbl> <dbl>
## 1 15.4 114.
df %>% summarize_if(is.numeric, c(Mean = mean,Median = median))
## # A tibble: 1 × 10
## Day_Mean Deaths_Mean Injuries_Mean Mw_Mean Depth_Mean Day_Median Deaths_Median
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14.0 15.4 114. 6.08 41.9 13 0
## # … with 3 more variables: Injuries_Median <dbl>, Mw_Median <dbl>,
## # Depth_Median <dbl>
library(corrr)
df_cor <- df %>%
select(c(Day,Injuries,Deaths,Mw,Depth)) %>%
correlate()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
df_cor
## # A tibble: 5 × 6
## term Day Injuries Deaths Mw Depth
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Day NA 0.00863 -0.000666 -0.0727 -0.150
## 2 Injuries 0.00863 NA 0.976 0.127 -0.0357
## 3 Deaths -0.000666 0.976 NA 0.128 -0.0309
## 4 Mw -0.0727 0.127 0.128 NA 0.0882
## 5 Depth -0.150 -0.0357 -0.0309 0.0882 NA
Here we see that the correlation looks great!
stretch(df_cor) %>%
ggplot(aes(x=x, y=y, fill=r, label = round(r,2))) +
geom_tile()
df %>%
group_by(MMI) %>%
summarize_if(is.numeric,mean)
## # A tibble: 10 × 6
## MMI Day Deaths Injuries Mw Depth
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 - 10.9 0 0 6.15 14.3
## 2 I 11 0 0 6.65 10
## 3 II 8.25 0 0.5 5.8 304.
## 4 III 10.8 0.0833 0 6.1 66.0
## 5 IV 14.4 0.0189 0.170 6.09 51.6
## 6 IX 15.5 563. 3198. 6.15 10
## 7 V 14.4 0.8 11.5 5.87 24.6
## 8 VI 13.7 0.0625 0.25 6.12 22.1
## 9 VII 16.6 0.552 38.8 6.05 24.4
## 10 VIII 13.5 9.54 283. 6.21 14.3
summary(df$Day)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 13.00 14.03 21.00 31.00
ggplot(data = df, aes(y= Day,color = Month)) +
geom_boxplot()
ggplot(df, aes(x= Day)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(df, aes(x= Injuries))+
geom_histogram(bins = 100)
ggplot(df, aes(x= MMI))+
geom_bar(aes(fill = Month), position = 'fill')
ggplot(df, aes(x= Injuries, y= Deaths)) +
geom_jitter(aes(color = MMI))
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
df1 <- df %>%
select(-Country,-Month,-Day)
ggpairs(df1, aes(color = Offshore))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ggmosaic)
##
## Attaching package: 'ggmosaic'
## The following object is masked from 'package:GGally':
##
## happy
ggplot(data = df)+
geom_mosaic( aes(x = product(MMI,Month),fill = Offshore),na.rm = TRUE)
ggplot(df,aes(sample = Injuries)) +
geom_qq() +
geom_qq_line()
df %>%
count(Offshore, MMI) %>%
spread(MMI,n, fill = 0)
## # A tibble: 2 × 11
## Offshore `-` I II III IV IX V VI VII VIII
## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 0 1 2 2 10 4 7 5 18 11
## 2 TRUE 8 1 2 10 43 0 8 11 11 2
df %>%
group_by(Offshore) %>%
summarise(Frequency = n()) %>%
mutate(Proportion = Frequency/sum(Frequency))
## # A tibble: 2 × 3
## Offshore Frequency Proportion
## <lgl> <int> <dbl>
## 1 FALSE 60 0.385
## 2 TRUE 96 0.615
Moving on to decision trees and classification. The variable you are predicting must be a factor!
df <- df %>% mutate(
Offshore = factor(Offshore == TRUE, levels = c(TRUE, FALSE),
labels = c('offshore','on land' ))
)
library(rpart)
library(rpart.plot)
tree <- rpart(Offshore ~.,data = df)
tree
## n= 156
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 156 60 offshore (0.61538462 0.38461538)
## 2) Country=Algeria,Antarctica,Australia,Chile,El,Fiji,France,India,Indonesia,Japan,New Caledonia,New Zealand,Nicaragua,Panama,Papua New Guinea,Philippines,Russia,South Georgia,Tonga,United States,Vanuatu 110 15 offshore (0.86363636 0.13636364)
## 4) Mw>=5.95 99 6 offshore (0.93939394 0.06060606) *
## 5) Mw< 5.95 11 2 on land (0.18181818 0.81818182) *
## 3) Country=Argentina,Armenia,China,Colombia,Croatia,Democratic,Greece,Guyana,Haiti,Iceland,Iran,Iraq,Mauritius,Mexico,Mongolia,Myanmar,Nepal,Peru,Rwanda,Southern,Spain,Taiwan,Tajikistan,Tanzania,Turkey,West 46 1 on land (0.02173913 0.97826087) *
rpart.plot(tree, extra = 2)
To make a prediction using the tree we have created, we pass predict the tree we have created and the dataset we want it to work on.
pred <- predict(tree, df, type = "class")
head(pred)
## 1 2 3 4 5 6
## offshore offshore on land offshore offshore offshore
## Levels: offshore on land
Each has been classified into its category. You can also recover the probabilities of the classification by dropping the type = “class”
predict(tree, df) %>%
head()
## offshore on land
## 1 0.93939394 0.06060606
## 2 0.93939394 0.06060606
## 3 0.02173913 0.97826087
## 4 0.93939394 0.06060606
## 5 0.93939394 0.06060606
## 6 0.93939394 0.06060606
We see that the first earthquake has a 93% shot of being offshore.
Confusion table follows by using the classified data.
confusion_table <- with(df, table(Offshore, pred))
confusion_table
## pred
## Offshore offshore on land
## offshore 93 3
## on land 6 54
I will now examine what happens if I withhold some of the data and do a cross validation. I split the data into thirds for testing and training.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
inTrain <- createDataPartition(y = df$Month, p = .66, list = FALSE)
df_train <- df %>% slice(inTrain)
df_test <- df %>% slice(-inTrain)
dim(df_train)
## [1] 107 9
dim(df_test)
## [1] 49 9
I will use the training set to build my model and then test it. If you look at my original tree, the country is very important. I am going to remove it from my tree and not allow that to be part of the decision tree (it was causing hell in me creating a tree!).
tree_from_train <- rpart(Offshore ~.,data = subset(df_train, select=c( -Country)))
pred_test <- predict(tree_from_train, subset(df_test, select=c( -Country)), type = "class")
with(df_test, table(Offshore, pred_test))
## pred_test
## Offshore offshore on land
## offshore 22 3
## on land 3 21
Pretty decent job predicting on the withheld data!
I’ll create a full tree next! I only have ~150 data points so I won’t have to chop it but if you have lots of data please do! I left the code that does chop it sample_n() gives n samples of the data
df_no_Country <- subset(df, select=c( -Country))
tree_full <- sample_n(df_no_Country,100) %>% #only keeps 100 of the data points ()
rpart(Offshore ~., data = ., control = rpart.control(minsplit = 2, cp = 0))
rpart.plot(tree_full, extra = 2, roundint=FALSE,
box.palette = list( "Gn", "Bu")) # specify 2 colors
Holy cow that looks difficult to interpret!
I see now that I was supposed to withhold some data to test with. I don’t have access to that data, but I can do predictions on all the data. Note the 100 above are perfectly classified so the 50 that are left are the only ones that could be mis-classified.
pred_full <- predict(tree_full, df_no_Country, type = "class")
with(df, table(Offshore, pred_full))
## pred_full
## Offshore offshore on land
## offshore 92 4
## on land 5 55
Still not terrible but 7 mis-classified when originally on the training data there are no mis-classifications. In any case you should see some over-fitting here. High variance and low bias has caused over-fitting on the training.
imp <- varImp(tree)
head(imp)
## Overall
## Country 54.165273
## Deaths 6.510731
## Injuries 23.322836
## MMI 19.314721
## Month 2.073474
## Mw 48.863238
imp %>% ggplot(aes(x = row.names(imp), weight = Overall)) +
geom_bar()
I am not satisfied with this way as I have no idea what varimp does, so I’ll repeat this using a chi-squared test for significance.
library(FSelector)
weights <- df %>% chi.squared(Offshore ~ ., data = .) %>%
as_tibble(rownames = "feature") %>%
arrange(desc(attr_importance))
weights
## # A tibble: 8 × 2
## feature attr_importance
## <chr> <dbl>
## 1 Country 0.865
## 2 Mw 0.713
## 3 MMI 0.519
## 4 Injuries 0.517
## 5 Month 0.298
## 6 Day 0
## 7 Deaths 0
## 8 Depth 0
ggplot(weights,
aes(x = attr_importance, y = reorder(feature, attr_importance))) +
geom_bar(stat = "identity") +
xlab("Importance score") + ylab("Feature")
Another tree because I was playing around…
tree1 <- rpart(MMI ~Offshore + Deaths + Mw,data = df, method = 'class')
tree1
## n= 156
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 156 103 IV (0.051 0.013 0.026 0.077 0.34 0.026 0.096 0.1 0.19 0.083)
## 2) Offshore=offshore 96 53 IV (0.083 0.01 0.021 0.1 0.45 0 0.083 0.11 0.11 0.021)
## 4) Mw< 6.8 79 40 IV (0.1 0.013 0.025 0.13 0.49 0 0.089 0.1 0.051 0) *
## 5) Mw>=6.8 17 10 VII (0 0 0 0 0.24 0 0.059 0.18 0.41 0.12) *
## 3) Offshore=on land 60 42 VII (0 0.017 0.033 0.033 0.17 0.067 0.12 0.083 0.3 0.18)
## 6) Mw< 5.35 21 16 IV (0 0 0.048 0.048 0.24 0.095 0.24 0.095 0.19 0.048)
## 12) Mw>=4.75 13 8 V (0 0 0.077 0.077 0.077 0.15 0.38 0.077 0.15 0) *
## 13) Mw< 4.75 8 4 IV (0 0 0 0 0.5 0 0 0.12 0.25 0.12) *
## 7) Mw>=5.35 39 25 VII (0 0.026 0.026 0.026 0.13 0.051 0.051 0.077 0.36 0.26)
## 14) Mw>=6.55 7 5 IX (0 0.14 0 0.14 0.14 0.29 0 0 0 0.29) *
## 15) Mw< 6.55 32 18 VII (0 0 0.031 0 0.13 0 0.063 0.094 0.44 0.25) *
rpart.plot(tree1, extra = 2)
If you want to implement C4.5 or C5.0, check out the examples here