Scrapping Wikipedia

First I am going to gather the data from wikipedia with the rvest library.

html <- read_html(url)

Let’s take a look at the html and see what we have gathered.

html
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

Look’s like we got something. Let’s see if we can extract the tables from the webpage.

html %>%
  html_table()
## [[1]]
## # A tibble: 13 × 4
##    X1                       X2                       X3            X4           
##    <chr>                    <chr>                    <chr>         <chr>        
##  1 ".mw-parser-output .loc… ".mw-parser-output .loc… "4.0−5.9 mag… "7.0−7.9 mag…
##  2 "4.0−5.9 magnitude\n 6.… "7.0−7.9 magnitude\n 8.…  <NA>          <NA>        
##  3 "Strongest magnitude"    "8.2 Mw United States"    <NA>          <NA>        
##  4 "Deadliest"              "7.2 Mw  Haiti 2,248 de…  <NA>          <NA>        
##  5 "Total fatalities"       "2,406"                   <NA>          <NA>        
##  6 "Number by magnitude"    "Number by magnitude"     <NA>          <NA>        
##  7 "9.0+"                   "0"                       <NA>          <NA>        
##  8 "8.0−8.9"                "3"                       <NA>          <NA>        
##  9 "7.0−7.9"                "12"                      <NA>          <NA>        
## 10 "6.0−6.9"                "98"                      <NA>          <NA>        
## 11 "5.0−5.9"                "1,524"                   <NA>          <NA>        
## 12 "4.0−4.9"                "9,450"                   <NA>          <NA>        
## 13 "← 2020"                 "← 2020"                  <NA>          <NA>        
## 
## [[2]]
## # A tibble: 1 × 2
##   X1                                      X2                                  
##   <chr>                                   <chr>                               
## 1 "4.0−5.9 magnitude\n 6.0−6.9 magnitude" "7.0−7.9 magnitude\n 8.0+ magnitude"
## 
## [[3]]
## # A tibble: 6 × 12
##   Magnitude `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
##   <chr>     <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1 8.0–9.9   1      2      2      1      1      0      1      1      1     
## 2 7.0–7.9   19     14     17     11     18     16     6      16     9     
## 3 6.0–6.9   187    117    123    143    124    127    104    118    135   
## 4 5.0–5.9   2,486  1,546  1,460  1,580  1,413  1,550  1,447  1,671  1,484 
## 5 4.0–4.9   13,129 10,955 11,877 15,817 13,777 13,700 10,544 12,782 11,897
## 6 Total     15,822 12,635 13,480 17,552 15,336 15,397 13,102 14,589 13,530
## # … with 2 more variables: 2020 <chr>, 2021 <chr>
## 
## [[4]]
## # A tibble: 4 × 8
##    Rank `Death toll` Magnitude Location     MMI    `Depth (km)` Date   Event    
##   <int> <chr>            <dbl> <chr>        <chr>         <dbl> <chr>  <chr>    
## 1     1 2,248              7.2 Haiti, Nipp… IX (V…           10 Augus… 2021 Hai…
## 2     2 108                6.2 Indonesia, … VIII …           18 Janua… 2021 Wes…
## 3     3 11                 7   Mexico, Gue… VIII …           20 Septe… 2021 Gue…
## 4     4 10                 6   Indonesia, … V (Mo…           67 April… 2021 Eas…
## 
## [[5]]
## # A tibble: 15 × 8
##     Rank Magnitude `Death toll` Location      MMI    `Depth (km)` Date  Event   
##    <int>     <dbl> <chr>        <chr>         <chr>         <dbl> <chr> <chr>   
##  1     1       8.2 0            United State… VII (…         32.2 July… 2021 Ch…
##  2     2       8.1 0            New Zealand,… VIII …         26.5 Marc… 2021 Ke…
##  3     2       8.1 0            South Georgi… VII (…         55.7 Augu… 2021 So…
##  4     4       7.7 0            New Caledoni… IV (L…         10   Febr… 2021 Lo…
##  5     5       7.5 0            South Georgi… VI (S…         63.3 Augu… 2021 So…
##  6     6       7.4 0            New Zealand,… VII (…         55.6 Marc… 2021 Ke…
##  7     7       7.3 0            China, Qingh… IX (V…         10   May … 2021 Ma…
##  8     7       7.3 0            New Zealand,… VI (S…         10   Marc… -       
##  9     9       7.2 2,248        Haiti, Nippes IX (V…         10   Augu… 2021 Ha…
## 10    10       7.1 1            Japan, Miyag… VIII …         49.9 Febr… 2021 Fu…
## 11    10       7.1 1            Philippines,… VII (…         65.6 Augu… 2021 Da…
## 12    10       7.1 0            South Georgi… IV (L…         14   Augu… 2021 So…
## 13    13       7   11           Mexico, Guer… VIII …         12.6 Sept… 2021 Gu…
## 14    13       7   0            Japan, Miyag… VII (…         54   Marc… March 2…
## 15    13       7   0            Philippines … VI (S…         80   Janu… 2021 Ta…
## 
## [[6]]
## # A tibble: 9 × 2
##   X1                  X2                          
##   <chr>               <chr>                       
## 1 Strongest magnitude 7.0 Mw  Philippines         
## 2 Deadliest           6.2 Mw  Indonesia 108 deaths
## 3 Total fatalities    108                         
## 4 Number by magnitude Number by magnitude         
## 5 8.0−8.9             0                           
## 6 7.0−7.9             1                           
## 7 6.0−6.9             11                          
## 8 5.0−5.9             136                         
## 9 4.0−4.9             1,021                       
## 
## [[7]]
## # A tibble: 20 × 8
##    Date   `Country and loc… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>  <chr>             <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 Date   Country and loca… Mw    Depth (km)   MMI   Notes Dead       Injured   
##  2 3[1]   United States, A… 6.1   21.0         VI    -     -          -         
##  3 6[2]   New Zealand offs… 6.2   37.1         IV    -     -          -         
##  4 6[3]   Croatia, Sisak-M… 4.7   10.0         VIII  It w… -          -         
##  5 6[5]   Indonesia, Goron… 6.1   148.0        IV    -     -          -         
##  6 8[6]   New Zealand offs… 6.3   224.0        IV    -     -          -         
##  7 8[7]   Vanuatu, Tafea o… 6.1   113.0        IV    -     -          -         
##  8 10[8]  Argentina, Jujuy… 6.1   217.0        IV    -     -          -         
##  9 10[9]  Vanuatu, Malampa… 6.1   160.0        IV    -     -          -         
## 10 10[10] Turkey, Ankara, … 4.3   10.0         IV    Seve… -          -         
## 11 11[12] Mongolia, Khövsg… 6.7   10.0         VIII  The … -          53        
## 12 14[15] Indonesia, West … 5.7   18.0         VII   It w… -          1         
## 13 14[17] Indonesia, West … 6.2   18.0         VIII  The … 108        3,369     
## 14 15[20] Iran, Hormozgan,… 5.5   8.0          VII   Abou… -          1         
## 15 19[22] Argentina, San J… 6.4   16.9         VII   Vari… -          14        
## 16 21[25] Philippines offs… 7.0   80.0         VI    In I… -          -         
## 17 23[27] Spain, Andalusia… 4.2   10.0         IV    It w… -          1         
## 18 23[29] Antarctica offsh… 6.9   9.8          V     Smal… -          -         
## 19 28[32] Spain, Andalusia… 4.3   10.0         IV    It w… -          -         
## 20 31[34] Guyana, Upper Ta… 5.5   5.4          VIII  It w… -          -         
## 
## [[8]]
## # A tibble: 9 × 2
##   X1                  X2                                                        
##   <chr>               <chr>                                                     
## 1 Strongest magnitude "7.7 Mw  New Caledonia"                                   
## 2 Deadliest           "7.1 Mw  Japan1 death5.9 Mw  Tajikistan1 death\n4.9 Mw  I…
## 3 Total fatalities    "3"                                                       
## 4 Number by magnitude "Number by magnitude"                                     
## 5 8.0−8.9             "0"                                                       
## 6 7.0−7.9             "2"                                                       
## 7 6.0−6.9             "14"                                                      
## 8 5.0−5.9             "253"                                                     
## 9 4.0−4.9             "1,182"                                                   
## 
## [[9]]
## # A tibble: 24 × 8
##    Date   `Country and loc… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>  <chr>             <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 Date   Country and loca… Mw    Depth (km)   MMI   Notes Dead       Injured   
##  2 3[38]  West Chile Rise   6.7   10.0         IV    -     -          -         
##  3 3[39]  Indonesia, West … 4.9   10.0         III   It w… 1          -         
##  4 7[41]  Philippines, Dav… 6.1   10.0         VII   The … -          14        
##  5 7[43]  Papua New Guinea… 6.3   10.0         IV    -     -          -         
##  6 10[44] New Caledonia of… 6.1   11.0         -     It w… -          -         
##  7 10[45] Indonesia, Bengk… 6.3   10.0         IV    -     -          -         
##  8 10[46] New Caledonia of… 6.1   10.0         -     It w… -          -         
##  9 10[47] New Caledonia of… 7.7   10.0         IV    The … -          -         
## 10 10[51] New Caledonia of… 6.1   11.7         -     Thes… -          -         
## # … with 14 more rows
## 
## [[10]]
## # A tibble: 9 × 2
##   X1                  X2                                           
##   <chr>               <chr>                                        
## 1 Strongest magnitude 8.1 Mw   New Zealand                         
## 2 Deadliest           5.4 Mw  China3 deaths5.1 Mw  Colombia3 deaths
## 3 Total fatalities    7                                            
## 4 Number by magnitude Number by magnitude                          
## 5 8.0−8.9             1                                            
## 6 7.0−7.9             3                                            
## 7 6.0−6.9             14                                           
## 8 5.0−5.9             340                                          
## 9 4.0−4.9             1,745                                        
## 
## [[11]]
## # A tibble: 25 × 8
##    Date  `Country and loc… Mw    `Depth (km)` MMI   Notes  Casualties Casualties
##    <chr> <chr>             <chr> <chr>        <chr> <chr>  <chr>      <chr>     
##  1 Date  Country and loca… Mw    Depth (km)   MMI   Notes  Dead       Injured   
##  2 1[77] Colombia, Antioq… 5.1   10.0         IX    Vario… 3          6         
##  3 3[81] Greece, Thessali… 6.3   8.0          VIII  The 2… 1          11        
##  4 4[85] New Zealand, Gis… 7.3   10.0         VI    A tsu… -          -         
##  5 4[89] Vanuatu, Torba o… 6.1   173.3        IV    -      -          -         
##  6 4[90] New Zealand, Ker… 7.4   43.0         VII   It wa… -          -         
##  7 4[92] New Zealand, Ker… 8.1   21.2         VIII  The 2… -          -         
##  8 4[96] New Zealand, Ker… 6.1   10.0         III   It wa… -          -         
##  9 4[97] Greece, Thessali… 5.8   10.0         VIII  It wa… -          -         
## 10 4[99] New Zealand, Ker… 6.5   10.0         III   Those… -          -         
## # … with 15 more rows
## 
## [[12]]
## # A tibble: 9 × 2
##   X1                  X2                            
##   <chr>               <chr>                         
## 1 Strongest magnitude 6.6 Mw  South Sandwich Islands
## 2 Deadliest           6.0 Mw  Indonesia10 deaths    
## 3 Total fatalities    12                            
## 4 Number by magnitude Number by magnitude           
## 5 8.0−8.9             0                             
## 6 7.0−7.9             0                             
## 7 6.0−6.9             14                            
## 8 5.0−5.9             131                           
## 9 4.0−4.9             1,099                         
## 
## [[13]]
## # A tibble: 20 × 8
##    Date    `Country and lo… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>   <chr>            <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 Date    Country and loc… Mw    Depth (km)   MMI   Notes Dead       Injured   
##  2 1[123]  New Zealand, Ke… 6.5   20.0         IV    It w… -          -         
##  3 1[124]  Algeria, Guelma… 4.8   10.0         VII   Some… -          -         
##  4 1[126]  Tonga offshore,… 6.0   595.0        II    -     -          -         
##  5 3[127]  South Georgia a… 6.6   10.0         I     -     -          -         
##  6 5[128]  New Zealand, Gi… 6.1   10.0         V     It w… -          -         
##  7 6[129]  Iraq, Sulaymani… 5.2   9.3          VII   Some… -          4         
##  8 7[131]  New Zealand, Ke… 6.1   10.0         III   It w… -          -         
##  9 10[132] Indonesia, East… 6.0   67.0         V     Duri… 10         104       
## 10 10[135] Philippines off… 6.1   311.3        III   -     -          -         
## 11 10[136] Papua New Guine… 6.0   10.0         IV    -     -          -         
## 12 18[137] Iran, Bushehr, … 5.8   8.0          VIII  Crac… -          5         
## 13 18[139] Taiwan, Hualien… 5.8   12.0         VIII  Mino… -          -         
## 14 20[141] Indonesia, Nort… 6.1   9.0          IV    -     -          -         
## 15 24[142] Tonga, Ha'apai … 6.5   301.0        IV    Thes… -          -         
## 16 25[143] Tonga, Tongatap… 6.5   246.0        IV    Thes… -          -         
## 17 27[144] Papua New Guine… 6.1   10.0         IV    -     -          -         
## 18 27[145] Indonesia, West… 5.0   57.4         IV    Two … -          -         
## 19 28[147] India, Assam, 9… 6.0   34.0         VII   The … 2          12        
## 20 29[151] New Zealand, Ke… 6.1   10.0         III   It w… -          -         
## 
## [[14]]
## # A tibble: 9 × 2
##   X1                  X2                   
##   <chr>               <chr>                
## 1 Strongest magnitude 7.3 Mw  China        
## 2 Deadliest           6.1 Mw  China3 deaths
## 3 Total fatalities    3                    
## 4 Number by magnitude Number by magnitude  
## 5 8.0−8.9             0                    
## 6 7.0−7.9             1                    
## 7 6.0−6.9             12                   
## 8 5.0−5.9             129                  
## 9 4.0−4.9             835                  
## 
## [[15]]
## # A tibble: 20 × 8
##    Date    `Country and lo… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>   <chr>            <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 Date    Country and loc… Mw    Depth (km)   MMI   Notes Dead       Injured   
##  2 1[152]  Japan, Miyagi o… 6.9   47.3         VII   3 pe… -          3         
##  3 2[154]  Indonesia, West… 5.5   21.0         V     An e… -          -         
##  4 5[156]  Indonesia, West… 5.7   23.1         V     The … -          -         
##  5 7[157]  Australia offsh… 6.0   10.0         -     -     -          -         
##  6 7[158]  Fiji, Lau offsh… 6.1   378.6        III   -     -          -         
##  7 12[159] El Salvador off… 6.0   22.0         IV    -     -          -         
##  8 12[160] Mauritius - Reu… 6.7   10.0         III   -     -          -         
##  9 13[161] Panama, Chiriqu… 6.0   10.0         IV    -     -          -         
## 10 13[162] Japan, Miyagi o… 6.0   32.0         IV    It w… -          -         
## 11 14[164] Indonesia, Nort… 6.7   11.0         IV    -     -          -         
## 12 17[165] Iran, North Kho… 5.4   10.0         VII   Some… -          25        
## 13 18[167] Nepal, Gandaki,… 5.3   10.0         V     Arou… -          6         
## 14 19[169] Southern East P… 6.7   10.0         I     -     -          -         
## 15 21[170] China, Yunnan, … 6.1   9.0          VIII  The … 3          32        
## 16 21[172] China, Southern… 7.3   10.0         IX    Seve… -          19        
## 17 21[175] France, Wallis … 6.5   10.0         VII   -     -          -         
## 18 21[176] Indonesia, East… 5.7   107.9        IV    More… -          3         
## 19 25[178] Rwanda, Western… 4.7   10.0         VII   Hous… -          3         
## 20 31[181] United States, … 6.1   43.9         VI    Mino… -          -         
## 
## [[16]]
## # A tibble: 9 × 2
##   X1                  X2                                              
##   <chr>               <chr>                                           
## 1 Strongest magnitude 6.5 Mw  New Zealand                             
## 2 Deadliest           5.0 Mw  Democratic Republic of the Congo2 deaths
## 3 Total fatalities    3                                               
## 4 Number by magnitude Number by magnitude                             
## 5 8.0−8.9             0                                               
## 6 7.0−7.9             0                                               
## 7 6.0−6.9             2                                               
## 8 5.0−5.9             100                                             
## 9 4.0−4.9             791                                             
## 
## [[17]]
## # A tibble: 9 × 8
##   Date    `Country and loc… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##   <chr>   <chr>             <chr> <chr>        <chr> <chr> <chr>      <chr>     
## 1 Date    Country and loca… Mw    Depth (km)   MMI   Notes Dead       Injured   
## 2 3[183]  Indonesia offsho… 6.2   9.9          IV    -     -          -         
## 3 10[184] Democratic Repub… 5.0   10.0         V     Vari… 2          3         
## 4 10[187] China, Yunnan, 1… 5.0   10.0         II    Two … -          2         
## 5 16[189] Indonesia, Maluk… 5.9   5.7          VI    Home… -          -         
## 6 20[193] New Zealand, Ker… 6.5   10.0         IV    It w… -          -         
## 7 23[194] Peru, Cañete off… 5.9   49.5         VII   Duri… 1          20        
## 8 23[199] Argentina, Mendo… 4.6   10.0         IV    Some… -          -         
## 9 26[201] Turkey, Bingöl, … 5.4   10.0         VI    Seve… -          1         
## 
## [[18]]
## # A tibble: 9 × 2
##   X1                  X2                        
##   <chr>               <chr>                     
## 1 Strongest magnitude 8.2 Mw  United States     
## 2 Deadliest           5.7 Mw  Tajikistan5 deaths
## 3 Total fatalities    6                         
## 4 Number by magnitude Number by magnitude       
## 5 8.0−8.9             1                         
## 6 7.0−7.9             0                         
## 7 6.0−6.9             12                        
## 8 5.0−5.9             135                       
## 9 4.0−4.9             1123                      
## 
## [[19]]
## # A tibble: 16 × 8
##    Date    `Country and lo… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>   <chr>            <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 Date    Country and loc… Mw    Depth (km)   MMI   Notes Dead       Injured   
##  2 2[203]  Fiji region      6.1   599.6        II    -     -          -         
##  3 4[204]  Chile, Atacama … 6.0   24.0         VI    Thes… -          -         
##  4 4[205]  Chile, Atacama … 6.0   22.0         VI    Thes… -          -         
##  5 8[206]  United States, … 6.0   7.5          VII   The … -          -         
##  6 10[208] Indonesia offsh… 6.1   43.6         IV    -     -          -         
##  7 10[209] Tajikistan, Dis… 5.7   12.8         VII   Five… 5          30        
##  8 18[212] Panama, Chiriqu… 6.1   9.1          IV    It w… -          -         
##  9 21[213] Papua New Guine… 6.0   8.7          IV    -     -          -         
## 10 21[214] Panama, Chiriqu… 6.7   10.0         VI    Powe… -          -         
## 11 23[216] Philippines, Ca… 6.7   110.0        V     Some… -          -         
## 12 24[219] New Zealand, Ke… 6.1   10.0         III   It w… -          -         
## 13 26[220] Indonesia, Cent… 6.3   11.0         VI    Powe… 1          -         
## 14 29[224] United States, … 8.2   35.0         VII   The … -          -         
## 15 29[227] Myanmar, Sagain… 5.5   10.0         VII   A br… -          -         
## 16 30[230] Peru, Piura, 9 … 6.2   34.4         VII   Mode… -          721       
## 
## [[20]]
## # A tibble: 9 × 2
##   X1                  X2                              
##   <chr>               <chr>                           
## 1 Strongest magnitude "8.1 Mw  South Sandwich Islands"
## 2 Deadliest           "7.2 Mw  Haiti2,248\ndeaths"    
## 3 Total fatalities    "2,250"                         
## 4 Number by magnitude "Number by magnitude"           
## 5 8.0−8.9             "1"                             
## 6 7.0−7.9             "4"                             
## 7 6.0−6.9             "15"                            
## 8 5.0−5.9             "296"                           
## 9 4.0−4.9             "955"                           
## 
## [[21]]
## # A tibble: 21 × 8
##    Date    `Country and lo… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>   <chr>            <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 Date    Country and loc… Mw    Depth (km)   MMI   Notes Dead       Injured   
##  2 3[233]  India offshore,… 6.1   10.0         IV    -     -          -         
##  3 11[234] Philippines, Da… 7.1   65.6         VII   The … 1          -         
##  4 12[236] South Georgia a… 7.5   47.2         VI    The … -          -         
##  5 12[237] South Georgia a… 8.1   55.7         VII   The … -          -         
##  6 12[238] South Georgia a… 6.7   35.0         -     Thes… -          -         
##  7 12[239] South Georgia a… 6.1   35.0         IV    Thes… -          -         
##  8 12[240] Spain, Andalusi… 4.6   10.0         VII   It w… -          -         
##  9 13[242] South Georgia a… 6.1   10.0         IV    It w… -          -         
## 10 14[243] United States, … 6.9   21.0         IV    It w… -          -         
## # … with 11 more rows
## 
## [[22]]
## # A tibble: 9 × 2
##   X1                  X2                        
##   <chr>               <chr>                     
## 1 Strongest magnitude "7.0 Mw  Mexico"          
## 2 Deadliest           "7.0 Mw  Mexico11\ndeaths"
## 3 Total fatalities    "14"                      
## 4 Number by magnitude "Number by magnitude"     
## 5 8.0−8.9             "0"                       
## 6 7.0−7.9             "1"                       
## 7 6.0−6.9             "5"                       
## 8 5.0−5.9             "22"                      
## 9 4.0−4.9             "69"                      
## 
## [[23]]
## # A tibble: 10 × 8
##    Date    `Country and lo… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>   <chr>            <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 Date    Country and loc… Mw    Depth (km)   MMI   Notes Dead       Injured   
##  2 7[260]  Tonga, Haʻapai … 6.0   10.0         IV    -     -          -         
##  3 8[261]  Mexico, Guerrer… 7.0   20.0         VIII  The … 11         23        
##  4 13[274] Iran, Razavi Kh… 5.1   10.0         V     Buil… -          10        
##  5 13[276] Argentina, Salt… 6.2   193.4        IV    -     -          -         
##  6 15[277] China, Sichuan,… 5.4   10.0         VII   Thre… 3          146       
##  7 20[281] Russia, Kuril I… 6.0   25.0         IV    -     -          -         
##  8 21[282] Chile, Biobío o… 6.4   12.5         V     -     -          -         
##  9 21[283] Australia, Vict… 5.9   10.0         VII   The … -          -         
## 10 22[285] Nicaragua, offs… 6.5   30.7         V     -     -          -         
## 
## [[24]]
## # A tibble: 10 × 2
##    `.mw-parser-output .navbar{display:in… `.mw-parser-output .navbar{display:in…
##    <chr>                                  <chr>                                 
##  1 "January"                              "West Sulawesi, Indonesia (6.2, Jan 1…
##  2 "February"                             "Mindanao, Philippines (6.0, Feb 7)\n…
##  3 "March"                                "Antioquia, Colombia (5.1, Mar 1)\nLa…
##  4 "April"                                "East Java, Indonesia (6.0, Apr 10)\n…
##  5 "May"                                  "Yunnan, China (6.1, May 21)\nQinghai…
##  6 "June"                                 "Mala, Peru (5.8, June 23)"           
##  7 "July"                                 "Antelope Valley, California (6.0, Ju…
##  8 "August"                               "Davao Oriental, Philippines (7.1, Au…
##  9 "September"                            "Guerrero, Mexico (7.0, Sep 7)\nMelbo…
## 10 "† indicates earthquake resulting in … "† indicates earthquake resulting in …
## 
## [[25]]
## # A tibble: 4 × 2
##   `vteEarthquakes by year`                       `vteEarthquakes by year`       
##   <chr>                                          <chr>                          
## 1 "19th century"                                 "1900"                         
## 2 "20th century"                                 "1901\n1902\n1903\n1904\n1905\…
## 3 "21st century"                                 "2001\n2002\n2003\n2004\n2005\…
## 4 "Historical earthquakes\nLists of earthquakes" "Historical earthquakes\nLists…

That was a lot of tables! Let’s dig in and see if we can extract the monthly ones. I think the seventh one is January’s.

tables <- html %>%
  html_table(header = TRUE)

tables[[7]][-1,]
## # A tibble: 19 × 8
##    Date   `Country and loc… Mw    `Depth (km)` MMI   Notes Casualties Casualties
##    <chr>  <chr>             <chr> <chr>        <chr> <chr> <chr>      <chr>     
##  1 3[1]   United States, A… 6.1   21.0         VI    -     -          -         
##  2 6[2]   New Zealand offs… 6.2   37.1         IV    -     -          -         
##  3 6[3]   Croatia, Sisak-M… 4.7   10.0         VIII  It w… -          -         
##  4 6[5]   Indonesia, Goron… 6.1   148.0        IV    -     -          -         
##  5 8[6]   New Zealand offs… 6.3   224.0        IV    -     -          -         
##  6 8[7]   Vanuatu, Tafea o… 6.1   113.0        IV    -     -          -         
##  7 10[8]  Argentina, Jujuy… 6.1   217.0        IV    -     -          -         
##  8 10[9]  Vanuatu, Malampa… 6.1   160.0        IV    -     -          -         
##  9 10[10] Turkey, Ankara, … 4.3   10.0         IV    Seve… -          -         
## 10 11[12] Mongolia, Khövsg… 6.7   10.0         VIII  The … -          53        
## 11 14[15] Indonesia, West … 5.7   18.0         VII   It w… -          1         
## 12 14[17] Indonesia, West … 6.2   18.0         VIII  The … 108        3,369     
## 13 15[20] Iran, Hormozgan,… 5.5   8.0          VII   Abou… -          1         
## 14 19[22] Argentina, San J… 6.4   16.9         VII   Vari… -          14        
## 15 21[25] Philippines offs… 7.0   80.0         VI    In I… -          -         
## 16 23[27] Spain, Andalusia… 4.2   10.0         IV    It w… -          1         
## 17 23[29] Antarctica offsh… 6.9   9.8          V     Smal… -          -         
## 18 28[32] Spain, Andalusia… 4.3   10.0         IV    It w… -          -         
## 19 31[34] Guyana, Upper Ta… 5.5   5.4          VIII  It w… -          -

Let’s now create our eventual final dataframe and make certain we add a column highlighting what month this happened in.

jan <- tables[[7]][-1,]

df <- jan %>% add_column(Month = 'January')
## Warning: The `.data` argument of `add_column()` must have unique names as of tibble 3.0.0.
## Use `.name_repair = "minimal"`.

Let’s automate the rest. It is currently August so there is only 8 months of data.

monthlist = c('January','February','March','April','May','June','July','August','September')
counter = c(2:9)

for (number in counter){
  newtable <- tables[[5+2*number]][-1,] #the minus one is to get rid of the title of the columns 
  newtable <- newtable %>% add_column(Month = monthlist[number])
  df <- bind_rows(df,newtable)
}
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
## New names:
## * Casualties -> Casualties...7
## * Casualties -> Casualties...8
head(df)
## # A tibble: 6 × 9
##   Date  `Country and locatio… Mw    `Depth (km)` MMI   Notes      Casualties...7
##   <chr> <chr>                 <chr> <chr>        <chr> <chr>      <chr>         
## 1 3[1]  United States, Alask… 6.1   21.0         VI    -          -             
## 2 6[2]  New Zealand offshore… 6.2   37.1         IV    -          -             
## 3 6[3]  Croatia, Sisak-Mosla… 4.7   10.0         VIII  It was an… -             
## 4 6[5]  Indonesia, Gorontalo… 6.1   148.0        IV    -          -             
## 5 8[6]  New Zealand offshore… 6.3   224.0        IV    -          -             
## 6 8[7]  Vanuatu, Tafea offsh… 6.1   113.0        IV    -          -             
## # … with 2 more variables: Casualties...8 <chr>, Month <chr>

I am going to need to do some cleaning now that I have gathered the data. In the Date column there is a reference hyperlink in square brackets. Let’s get rid of that and rename it Day

df <- df %>%
  mutate(
    Day = str_remove_all(Date,pattern = "\\[*[0-9]*\\]")
  )

Next I’d like to change all the casualty entries from NA to zero.

df <- df %>%
  mutate(
    Deaths = str_replace_all(Casualties...7,"\\-","0"),
    Injuries = str_replace_all(Casualties...8,"\\-","0")
  )

They were not actually NA but were -. Tricky!

Let’s continue looking to extract some information about country and location

df<- df %>%
  mutate(
    Offshore = str_detect(`Country and location`,"offshore"),
    Country = str_extract(`Country and location`,'[A-z]+')
  )

A couple of these are wrong, I’ll fix the obvious ones and come back if I see others I missed later.

df <- df %>%
  mutate(
    Country = replace(Country, Country == 'United','United States'),
    Country = replace(Country, Country == 'South', 'South Georgia'),
    Country = replace(Country, Country == 'Papua','Papua New Guinea')
  )
df <- df %>%
  mutate(
    Country = replace(Country, str_detect(`Country and location`,'New Zealand'),'New Zealand'),
    Country = replace(Country, str_detect(`Country and location`,'New Cal'),'New Caledonia')
                 )
df <- df %>%
  rename(Depth = 'Depth (km)')

Okay I think I have extracted all the info I am going to. I’ll clean up the dataset and organize it.

df <- df %>%
  select(c(Month,Day,Country,Deaths,Injuries,Mw,Depth,MMI,Offshore))
df <- df %>%
  mutate(
    Deaths = str_remove_all(Deaths,","),
    Injuries = str_remove_all(Injuries, ",")
  )
df <- df %>%
  mutate(Deaths = as.integer(Deaths),
         Injuries = as.integer(Injuries),
         Depth = as.numeric(Depth),
         Day = as.integer(Day),
         Mw = as.numeric(Mw))

Ready For EDA

df <- df %>%
  drop_na()
df %>%
  summarize(Average_Deaths = mean(Deaths),
            Average_Injuries = mean(Injuries))
## # A tibble: 1 × 2
##   Average_Deaths Average_Injuries
##            <dbl>            <dbl>
## 1           15.4             114.
df %>% summarize_if(is.numeric, c(Mean = mean,Median = median))
## # A tibble: 1 × 10
##   Day_Mean Deaths_Mean Injuries_Mean Mw_Mean Depth_Mean Day_Median Deaths_Median
##      <dbl>       <dbl>         <dbl>   <dbl>      <dbl>      <dbl>         <dbl>
## 1     14.0        15.4          114.    6.08       41.9         13             0
## # … with 3 more variables: Injuries_Median <dbl>, Mw_Median <dbl>,
## #   Depth_Median <dbl>
library(corrr)
df_cor <- df %>%
  select(c(Day,Injuries,Deaths,Mw,Depth)) %>%
  correlate()
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
df_cor
## # A tibble: 5 × 6
##   term           Day Injuries    Deaths      Mw   Depth
##   <chr>        <dbl>    <dbl>     <dbl>   <dbl>   <dbl>
## 1 Day      NA         0.00863 -0.000666 -0.0727 -0.150 
## 2 Injuries  0.00863  NA        0.976     0.127  -0.0357
## 3 Deaths   -0.000666  0.976   NA         0.128  -0.0309
## 4 Mw       -0.0727    0.127    0.128    NA       0.0882
## 5 Depth    -0.150    -0.0357  -0.0309    0.0882 NA

Here we see that the correlation looks great!

stretch(df_cor) %>%
  ggplot(aes(x=x, y=y, fill=r, label = round(r,2))) +
  geom_tile()

df %>%
  group_by(MMI) %>%
  summarize_if(is.numeric,mean)
## # A tibble: 10 × 6
##    MMI     Day   Deaths Injuries    Mw Depth
##    <chr> <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1 -     10.9    0         0      6.15  14.3
##  2 I     11      0         0      6.65  10  
##  3 II     8.25   0         0.5    5.8  304. 
##  4 III   10.8    0.0833    0      6.1   66.0
##  5 IV    14.4    0.0189    0.170  6.09  51.6
##  6 IX    15.5  563.     3198.     6.15  10  
##  7 V     14.4    0.8      11.5    5.87  24.6
##  8 VI    13.7    0.0625    0.25   6.12  22.1
##  9 VII   16.6    0.552    38.8    6.05  24.4
## 10 VIII  13.5    9.54    283.     6.21  14.3
summary(df$Day)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   13.00   14.03   21.00   31.00
ggplot(data = df, aes(y= Day,color = Month)) +
  geom_boxplot()

ggplot(df, aes(x= Day)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df, aes(x= Injuries))+
  geom_histogram(bins = 100)

ggplot(df, aes(x= MMI))+
  geom_bar(aes(fill = Month), position = 'fill')

ggplot(df, aes(x= Injuries, y= Deaths)) +
  geom_jitter(aes(color = MMI))

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
df1 <- df %>% 
  select(-Country,-Month,-Day) 


ggpairs(df1, aes(color = Offshore))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggmosaic)
## 
## Attaching package: 'ggmosaic'
## The following object is masked from 'package:GGally':
## 
##     happy
ggplot(data = df)+
  geom_mosaic( aes(x = product(MMI,Month),fill = Offshore),na.rm = TRUE) 

ggplot(df,aes(sample = Injuries)) +
  geom_qq() +
  geom_qq_line()

df %>%
  count(Offshore, MMI) %>%
  spread(MMI,n, fill = 0)
## # A tibble: 2 × 11
##   Offshore   `-`     I    II   III    IV    IX     V    VI   VII  VIII
##   <lgl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE        0     1     2     2    10     4     7     5    18    11
## 2 TRUE         8     1     2    10    43     0     8    11    11     2
df %>%
  group_by(Offshore) %>%
  summarise(Frequency = n()) %>%
  mutate(Proportion = Frequency/sum(Frequency))
## # A tibble: 2 × 3
##   Offshore Frequency Proportion
##   <lgl>        <int>      <dbl>
## 1 FALSE           60      0.385
## 2 TRUE            96      0.615

Decision Trees

Moving on to decision trees and classification. The variable you are predicting must be a factor!

df <- df %>% mutate(
  Offshore = factor(Offshore == TRUE, levels = c(TRUE, FALSE),
                    labels = c('offshore','on land' ))
)
library(rpart)
library(rpart.plot)

tree <- rpart(Offshore ~.,data = df)
tree
## n= 156 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 156 60 offshore (0.61538462 0.38461538)  
##   2) Country=Algeria,Antarctica,Australia,Chile,El,Fiji,France,India,Indonesia,Japan,New Caledonia,New Zealand,Nicaragua,Panama,Papua New Guinea,Philippines,Russia,South Georgia,Tonga,United States,Vanuatu 110 15 offshore (0.86363636 0.13636364)  
##     4) Mw>=5.95 99  6 offshore (0.93939394 0.06060606) *
##     5) Mw< 5.95 11  2 on land (0.18181818 0.81818182) *
##   3) Country=Argentina,Armenia,China,Colombia,Croatia,Democratic,Greece,Guyana,Haiti,Iceland,Iran,Iraq,Mauritius,Mexico,Mongolia,Myanmar,Nepal,Peru,Rwanda,Southern,Spain,Taiwan,Tajikistan,Tanzania,Turkey,West 46  1 on land (0.02173913 0.97826087) *
rpart.plot(tree, extra = 2)

To make a prediction using the tree we have created, we pass predict the tree we have created and the dataset we want it to work on.

pred <- predict(tree, df, type = "class")
head(pred)
##        1        2        3        4        5        6 
## offshore offshore  on land offshore offshore offshore 
## Levels: offshore on land

Each has been classified into its category. You can also recover the probabilities of the classification by dropping the type = “class”

predict(tree, df) %>%
  head()
##     offshore    on land
## 1 0.93939394 0.06060606
## 2 0.93939394 0.06060606
## 3 0.02173913 0.97826087
## 4 0.93939394 0.06060606
## 5 0.93939394 0.06060606
## 6 0.93939394 0.06060606

We see that the first earthquake has a 93% shot of being offshore.

Confusion table follows by using the classified data.

confusion_table <- with(df, table(Offshore, pred))
confusion_table
##           pred
## Offshore   offshore on land
##   offshore       93       3
##   on land         6      54

I will now examine what happens if I withhold some of the data and do a cross validation. I split the data into thirds for testing and training.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
inTrain <- createDataPartition(y = df$Month, p = .66, list = FALSE)
df_train <- df %>% slice(inTrain)
df_test <- df %>% slice(-inTrain)
dim(df_train)
## [1] 107   9
dim(df_test)
## [1] 49  9

I will use the training set to build my model and then test it. If you look at my original tree, the country is very important. I am going to remove it from my tree and not allow that to be part of the decision tree (it was causing hell in me creating a tree!).

tree_from_train <- rpart(Offshore ~.,data = subset(df_train, select=c( -Country)))
pred_test <- predict(tree_from_train, subset(df_test, select=c( -Country)), type = "class")
with(df_test, table(Offshore, pred_test))
##           pred_test
## Offshore   offshore on land
##   offshore       22       3
##   on land         3      21

Pretty decent job predicting on the withheld data!

I’ll create a full tree next! I only have ~150 data points so I won’t have to chop it but if you have lots of data please do! I left the code that does chop it sample_n() gives n samples of the data

df_no_Country <- subset(df, select=c( -Country))
tree_full <- sample_n(df_no_Country,100) %>% #only keeps 100 of the data points ()
  rpart(Offshore ~., data = ., control = rpart.control(minsplit = 2, cp = 0))

rpart.plot(tree_full, extra = 2, roundint=FALSE,
  box.palette = list( "Gn", "Bu")) # specify 2 colors

Holy cow that looks difficult to interpret!

I see now that I was supposed to withhold some data to test with. I don’t have access to that data, but I can do predictions on all the data. Note the 100 above are perfectly classified so the 50 that are left are the only ones that could be mis-classified.

pred_full <- predict(tree_full, df_no_Country, type = "class")
with(df, table(Offshore, pred_full))
##           pred_full
## Offshore   offshore on land
##   offshore       92       4
##   on land         5      55

Still not terrible but 7 mis-classified when originally on the training data there are no mis-classifications. In any case you should see some over-fitting here. High variance and low bias has caused over-fitting on the training.

imp <- varImp(tree)
head(imp)
##            Overall
## Country  54.165273
## Deaths    6.510731
## Injuries 23.322836
## MMI      19.314721
## Month     2.073474
## Mw       48.863238
imp %>% ggplot(aes(x = row.names(imp), weight = Overall)) +
  geom_bar()

I am not satisfied with this way as I have no idea what varimp does, so I’ll repeat this using a chi-squared test for significance.

library(FSelector)

weights <- df %>% chi.squared(Offshore ~ ., data = .) %>%
  as_tibble(rownames = "feature") %>%
  arrange(desc(attr_importance))
weights
## # A tibble: 8 × 2
##   feature  attr_importance
##   <chr>              <dbl>
## 1 Country            0.865
## 2 Mw                 0.713
## 3 MMI                0.519
## 4 Injuries           0.517
## 5 Month              0.298
## 6 Day                0    
## 7 Deaths             0    
## 8 Depth              0
ggplot(weights,
  aes(x = attr_importance, y = reorder(feature, attr_importance))) +
  geom_bar(stat = "identity") +
  xlab("Importance score") + ylab("Feature")

Another tree because I was playing around…

tree1 <- rpart(MMI ~Offshore + Deaths + Mw,data = df, method = 'class')
tree1
## n= 156 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 156 103 IV (0.051 0.013 0.026 0.077 0.34 0.026 0.096 0.1 0.19 0.083)  
##    2) Offshore=offshore 96  53 IV (0.083 0.01 0.021 0.1 0.45 0 0.083 0.11 0.11 0.021)  
##      4) Mw< 6.8 79  40 IV (0.1 0.013 0.025 0.13 0.49 0 0.089 0.1 0.051 0) *
##      5) Mw>=6.8 17  10 VII (0 0 0 0 0.24 0 0.059 0.18 0.41 0.12) *
##    3) Offshore=on land 60  42 VII (0 0.017 0.033 0.033 0.17 0.067 0.12 0.083 0.3 0.18)  
##      6) Mw< 5.35 21  16 IV (0 0 0.048 0.048 0.24 0.095 0.24 0.095 0.19 0.048)  
##       12) Mw>=4.75 13   8 V (0 0 0.077 0.077 0.077 0.15 0.38 0.077 0.15 0) *
##       13) Mw< 4.75 8   4 IV (0 0 0 0 0.5 0 0 0.12 0.25 0.12) *
##      7) Mw>=5.35 39  25 VII (0 0.026 0.026 0.026 0.13 0.051 0.051 0.077 0.36 0.26)  
##       14) Mw>=6.55 7   5 IX (0 0.14 0 0.14 0.14 0.29 0 0 0 0.29) *
##       15) Mw< 6.55 32  18 VII (0 0 0.031 0 0.13 0 0.063 0.094 0.44 0.25) *
rpart.plot(tree1, extra = 2)

If you want to implement C4.5 or C5.0, check out the examples here