Reproducible Research Course Main Project

This document is an exploratory review of the NOAA Storm Database and addresses the following key questions:

  1. Across the United States, which types of weather events are most harmful with respect to population health?
  2. Which types of events have the greatest economic consequences?

The dataset used for analysis is from the NOAA strom tracking database

In addition there is a FAQ.

Load the data and parse it, this will download the compressed file and parse into R, if the file has already been downloaded locally it will skip that step.

Lets go ahead and dump out a summary of EVTYPE since this will best answer our question to categorize by event type.

URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
localZipFile <- "StormData.bz2"
if (!file.exists(localZipFile)) {
    download.file(URL, destfile=localZipFile)
}

# this bz2 jas 500 MB unzipped on disk, takes about 80s on my PC
system.time(stormdata <- read.csv(bzfile(localZipFile), header=TRUE))
##    user  system elapsed 
##   78.58    0.30   79.05
# we know the date is easy to reformat
stormdata$date <- mdy_hms(stormdata$BGN_DATE)

summary(stormdata$EVTYPE)
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE                 BLIZZARD 
##                     3392                     2761                     2719 
##                  DROUGHT                ICE STORM           EXCESSIVE HEAT 
##                     2488                     2006                     1678 
##               HIGH WINDS         WILD/FOREST FIRE             FROST/FREEZE 
##                     1533                     1457                     1342 
##                DENSE FOG       WINTER WEATHER/MIX           TSTM WIND/HAIL 
##                     1293                     1104                     1028 
##  EXTREME COLD/WIND CHILL                     HEAT                HIGH SURF 
##                     1002                      767                      725 
##           TROPICAL STORM           FLASH FLOODING             EXTREME COLD 
##                      690                      682                      655 
##            COASTAL FLOOD         LAKE-EFFECT SNOW        FLOOD/FLASH FLOOD 
##                      650                      636                      624 
##                LANDSLIDE                     SNOW          COLD/WIND CHILL 
##                      600                      587                      539 
##                      FOG              RIP CURRENT              MARINE HAIL 
##                      538                      470                      442 
##               DUST STORM                AVALANCHE                     WIND 
##                      427                      386                      340 
##             RIP CURRENTS              STORM SURGE            FREEZING RAIN 
##                      304                      261                      250 
##              URBAN FLOOD     HEAVY SURF/HIGH SURF        EXTREME WINDCHILL 
##                      249                      228                      204 
##             STRONG WINDS           DRY MICROBURST    ASTRONOMICAL LOW TIDE 
##                      196                      186                      174 
##                HURRICANE              RIVER FLOOD               LIGHT SNOW 
##                      174                      173                      154 
##         STORM SURGE/TIDE            RECORD WARMTH         COASTAL FLOODING 
##                      148                      146                      143 
##               DUST DEVIL         MARINE HIGH WIND        UNSEASONABLY WARM 
##                      141                      135                      126 
##                 FLOODING   ASTRONOMICAL HIGH TIDE        MODERATE SNOWFALL 
##                      120                      103                      101 
##           URBAN FLOODING               WINTRY MIX        HURRICANE/TYPHOON 
##                       98                       90                       88 
##            FUNNEL CLOUDS               HEAVY SURF              RECORD HEAT 
##                       87                       84                       81 
##                   FREEZE                HEAT WAVE                     COLD 
##                       74                       74                       72 
##              RECORD COLD                      ICE  THUNDERSTORM WINDS HAIL 
##                       64                       61                       61 
##      TROPICAL DEPRESSION                    SLEET         UNSEASONABLY DRY 
##                       60                       59                       56 
##                    FROST              GUSTY WINDS      THUNDERSTORM WINDSS 
##                       53                       53                       51 
##       MARINE STRONG WIND                    OTHER               SMALL HAIL 
##                       48                       48                       47 
##                   FUNNEL             FREEZING FOG             THUNDERSTORM 
##                       46                       45                       45 
##       Temperature record          TSTM WIND (G45)         Coastal Flooding 
##                       43                       39                       38 
##              WATERSPOUTS    MONTHLY PRECIPITATION                    WINDS 
##                       37                       36                       36 
##                  (Other) 
##                     2940

Data Processing

We can see this will need to be normalized, since there are a great variety of levels for this attribute. The following code will go through and use grepl to match the patterns and create a new feature called catType to help normalize these event types.

#
# levelfactor(f, dict(code,pattern))
#   returns factor that is simplified

levelfactor <- function(f, myDict) {
    g <- as.character(f)
    levels(g) <- c(levels(g), levels(myDict$code))
    # unfortanely at this time I can think of no
    # better way than loop through the rules
    #     ... at least it's hidden in this function
    for(i in 1:nrow(myDict)) {
        g[grepl(myDict[i,2], g, ignore.case=TRUE)] <- myDict[i,1]
    }
    as.factor(as.character(g))
}


# there's got to be a better way to do this
dict <- data.frame(code="RAIN", pattern="RAIN", stringsAsFactors=FALSE)
dict[nrow(dict)+1,] <- list(code="WIND", pattern="WIND|WND")
dict[nrow(dict)+1,] <- list(code="FLOOD", pattern="FLO?O?O?D")
dict[nrow(dict)+1,] <- list(code="FLOOD", pattern="TSUMANI|RISING WATER")
dict[nrow(dict)+1,] <- list(code="SNOW", pattern="SNOW|WINTER|WINTRY")
dict[nrow(dict)+1,] <- list(code="RAIN", pattern="SHOWER|MICROURST|BURST|HEAVY MIX")
dict[nrow(dict)+1,] <- list(code="FIRE", pattern="FIRE|SMOKE")
dict[nrow(dict)+1,] <- list(code="HEAT", pattern="HEAT|HIGH|HOT|WARM|EXCESSIVE")
dict[nrow(dict)+1,] <- list(code="COLD", pattern="COLD|FREEZ|FROST|LOW|COOL")
dict[nrow(dict)+1,] <- list(code="FOG", pattern="FOG")
dict[nrow(dict)+1,] <- list(code="HAIL", pattern="HAIL")
dict[nrow(dict)+1,] <- list(code="WATERSPOUT", pattern="SPOUT|FUNNEL")
dict[nrow(dict)+1,] <- list(code="HURRICANE", pattern="HURRICANE|TYPHOON")
dict[nrow(dict)+1,] <- list(code="TORNADO", pattern="TORNADO|TORNDAO")
dict[nrow(dict)+1,] <- list(code="FOG", pattern="FOG")
dict[nrow(dict)+1,] <- list(code="THUNDERSTORM", pattern="THUNDER|TSTM")
dict[nrow(dict)+1,] <- list(code="LIGHTNING", pattern="LIG[HN]TN?ING")
dict[nrow(dict)+1,] <- list(code="TIDE", pattern="TIDE|SURF|WAVE|RIP CURRENT")
dict[nrow(dict)+1,] <- list(code="ICE", pattern="IC[EY]")
dict[nrow(dict)+1,] <- list(code="TEMPERATURE", pattern="TEMPERATURE|EXPOSURE")
dict[nrow(dict)+1,] <- list(code="DUST DEVIL", pattern="DUST DEVIL")
dict[nrow(dict)+1,] <- list(code="OTHER", pattern="OTHER|summary|NONE|[?]")
dict[nrow(dict)+1,] <- list(code="MARINE", pattern="MARINE|SEAS|DROWNING|SWELLS")
dict[nrow(dict)+1,] <- list(code="LANDSLIDE", pattern="SLIDE|LANDSLUMP")
dict[nrow(dict)+1,] <- list(code="PRECIPITATION", pattern="PRECIP")
dict[nrow(dict)+1,] <- list(code="SLEET", pattern="SLEET")
dict[nrow(dict)+1,] <- list(code="GLAZE", pattern="GLAZE")
dict[nrow(dict)+1,] <- list(code="DRY", pattern="DRY|DRIEST")
dict[nrow(dict)+1,] <- list(code="DUST", pattern="DUST")
dict[nrow(dict)+1,] <- list(code="TROPICAL STORM", pattern="TROPICAL STORM")
dict[nrow(dict)+1,] <- list(code="GUSTNADO", pattern="GUSTNADO")
dict[nrow(dict)+1,] <- list(code="VOLCANIC ASH", pattern="VOLCANIC")
dict[nrow(dict)+1,] <- list(code="WET", pattern="WET")
dict[nrow(dict)+1,] <- list(code="COASTAL STORM", pattern="COASTAL")
dict[nrow(dict)+1,] <- list(code="URBAN/SMALL STREAM", pattern="SMALL|STREAM")
dict[nrow(dict)+1,] <- list(code="BEACH EROSION", pattern="BEACH")
dict[nrow(dict)+1,] <- list(code="AVALANCHE", pattern="AVALANCH?E")
dict[nrow(dict)+1,] <- list(code="OTHER", pattern="Metro Storm|No Severe|NORTHERN LIGHTS|SOUTHEAST")
dict[nrow(dict)+1,] <- list(code="OTHER", pattern="APACHE COUNTY|MILD PATTERN|VOG")
dict[nrow(dict)+1,] <- list(code="DAM", pattern="DAM")
dict[nrow(dict)+1,] <- list(code="HURRICANE", pattern="REMNANTS OF FLOYD|RED FLAG")
dict[nrow(dict)+1,] <- list(code="CLOUD", pattern="CLOUD")
dict[nrow(dict)+1,] <- list(code="BLIZZARD", pattern="BLIZZARD")
dict[nrow(dict)+1,] <- list(code="WIND", pattern="TURBULENCE")

stormdata$catType <- levelfactor(stormdata$EVTYPE, dict)
summary(stormdata$catType)
##           AVALANCHE       BEACH EROSION            BLIZZARD 
##                 387                   5                2722 
##               CLOUD       COASTAL STORM                COLD 
##                  11                  14                2690 
##                 DAM             DROUGHT                 DRY 
##                   5                2488                  35 
##                DUST                FIRE               FLOOD 
##                 582                4261               86091 
##                 FOG               GLAZE            GUSTNADO 
##                1836                  43                   7 
##                HAIL                HEAT           HURRICANE 
##              289276                4109                 302 
##                 ICE           LANDSLIDE           LIGHTNING 
##                2131                 648               15763 
##              MARINE               OTHER       PRECIPITATION 
##                  85                 139                  91 
##                RAIN              SEICHE               SLEET 
##               12463                  21                  71 
##                SNOW         STORM SURGE         TEMPERATURE 
##               37327                 261                  75 
##        THUNDERSTORM                TIDE             TORNADO 
##                  95                1018               60685 
## TROPICAL DEPRESSION      TROPICAL STORM             TSUNAMI 
##                  60                 697                  20 
##  URBAN/SMALL STREAM        VOLCANIC ASH          WATERSPOUT 
##                  18                  29               10845 
##                 WET                WIND 
##                  11              364880

As you can see, we still have a lot of categories, but this is greatly reduced, let’s look at the top 10 by incident count

stormdata %>% 
    group_by(catType) %>% 
    summarize(count=n()) %>% 
    arrange(desc(count))%>% 
    slice(1:10) %>% 
    ungroup()
## # A tibble: 10 x 2
##       catType  count
##        <fctr>  <int>
##  1       WIND 364880
##  2       HAIL 289276
##  3      FLOOD  86091
##  4    TORNADO  60685
##  5       SNOW  37327
##  6  LIGHTNING  15763
##  7       RAIN  12463
##  8 WATERSPOUT  10845
##  9       FIRE   4261
## 10       HEAT   4109
# go ahead and re-order the factors based on incident count
factorOrder <- group_by(stormdata, catType) %>%
    summarize(count=n()) %>% 
    arrange(desc(count))

stormdata$catType <- factor(stormdata$catType, levels(stormdata$catType)[factorOrder$catType])

We can confirm the normalization with the following code, since there are still around 50 categories, we will only display the groups for the top 3 categories in this document

unique(droplevels(stormdata[stormdata$catType == "WIND",]$EVTYPE))
##   [1] TSTM WIND                      HURRICANE OPAL/HIGH WINDS     
##   [3] THUNDERSTORM WINDS             THUNDERSTORM WIND             
##   [5] HIGH WINDS                     THUNDERSTORM WINDS LIGHTNING  
##   [7] THUNDERSTORM WINDS/HAIL        WIND                          
##   [9] THUNDERSTORM WINDS HAIL        HIGH WIND                     
##  [11] WIND CHILL                     HIGH WIND/BLIZZARD            
##  [13] HIGH WIND AND HIGH TIDES       HIGH WIND/BLIZZARD/FREEZING RA
##  [15] HIGH WIND AND HEAVY SNOW       RECORD COLD AND HIGH WIND     
##  [17] HIGH WIND/ BLIZZARD            BLIZZARD/HIGH WIND            
##  [19] HIGH WIND/LOW WIND CHILL       HIGH WINDS AND WIND CHILL     
##  [21] HEAVY SNOW/HIGH WINDS/FREEZING WIND CHILL/HIGH WIND          
##  [23] HIGH WIND/WIND CHILL/BLIZZARD  HIGH WIND/WIND CHILL          
##  [25] HIGH WIND/HEAVY SNOW           HIGH WIND/SEAS                
##  [27] SEVERE TURBULENCE              HEAVY SNOW/WIND               
##  [29] WIND DAMAGE                    THUNDERSTORM WINDS/FUNNEL CLOU
##  [31] WINTER STORM/HIGH WIND         WINTER STORM/HIGH WINDS       
##  [33] GUSTY WINDS                    STRONG WINDS                  
##  [35] SNOW AND WIND                  HIGH WINDS DUST STORM         
##  [37] WINTER STORM HIGH WINDS        SEVERE THUNDERSTORM WINDS     
##  [39] THUNDERSTORMS WINDS            WINDS                         
##  [41] FLASH FLOOD WINDS              STRONG WIND                   
##  [43] HIGH WIND DAMAGE               DOWNBURST WINDS               
##  [45] DRY MICROBURST WINDS           DRY MIRCOBURST WINDS          
##  [47] MICROBURST WINDS               HIGH WINDS 57                 
##  [49] HIGH WINDS 66                  HIGH WINDS 76                 
##  [51] HIGH WINDS 63                  HIGH WINDS 67                 
##  [53] HEAVY SNOW/HIGH WINDS          HIGH WINDS 82                 
##  [55] HIGH WINDS 80                  HIGH WINDS 58                 
##  [57] LIGHTNING THUNDERSTORM WINDSS  HIGH WINDS 73                 
##  [59] HIGH WINDS 55                  THUNDERSTORM WINDS 60         
##  [61] THUNDERSTORM WINDSS            HIGH WINDS/FLOODING           
##  [63] TORNADOES, TSTM WIND, HAIL     LIGHTNING THUNDERSTORM WINDS  
##  [65] THUNDERSTORM WINDS53           THUNDERSTORM WINDS 13         
##  [67] HEAVY SNOW/HIGH WIND           HIGH WINDS/                   
##  [69] EXTREME WIND CHILLS            HIGH  WINDS                   
##  [71] EXTREME WIND CHILL             GRADIENT WINDS                
##  [73] THUNDERSTORM WINDS URBAN FLOOD THUNDERSTORM WINDS SMALL STREA
##  [75] BLOWING SNOW- EXTREME WIND CHI SNOW- HIGH WIND- WIND CHILL   
##  [77] THUNDERSTORM WINDS 2           TSTM WIND 51                  
##  [79] TSTM WIND 50                   TSTM WIND 52                  
##  [81] TSTM WIND 55                   THUNDERSTORM WINDS 61         
##  [83] THUNDERTORM WINDS              HAIL/WINDS                    
##  [85] WIND STORM                     HAIL/WIND                     
##  [87] WIND/HAIL                      THUNDERSTORMS WIND            
##  [89] THUNDERSTORM  WINDS            TUNDERSTORM WIND              
##  [91] THUNDERTSORM WIND              THUNDERSTORM WINDS/ HAIL      
##  [93] THUNDERSTORM WIND/LIGHTNING    THUNDESTORM WINDS             
##  [95] HIGH WIND 63                   HIGH WINDS/COASTAL FLOOD      
##  [97] THUNDERSTORM WIND G50          THUNDERSTROM WINDS            
##  [99] THUNDERSTORM WINDS      LE CEN BLIZZARD AND EXTREME WIND CHIL
## [101] LOW WIND CHILL                 BLOWING SNOW & EXTREME WIND CH
## [103] THUNDERSTORM WINDS G           DUST STORM/HIGH WINDS         
## [105] THUNDERSTORM WIND G60          THUNDERSTORM WINDS.           
## [107] THUNDERSTORM WIND G55          THUNDERSTORM WINDS G60        
## [109] THUNDERSTORM WINDS FUNNEL CLOU THUNDERSTORM WINDS 62         
## [111] HEAVY SNOW AND HIGH WINDS      HEAVY SNOW/HIGH WINDS & FLOOD 
## [113] THUNDERSTORM WINDS/FLASH FLOOD HIGH WIND 70                  
## [115] THUNDERSTORM WINDS 53          THUNDERSTORM WIND 59          
## [117] THUNDERSTORM WIND 52           THUNDERSTORM WIND 69          
## [119] LIGHTNING AND WINDS            TSTM WIND G58                 
## [121] THUNDERSTORMW WINDS            THUNDERSTORM WIND 60 MPH      
## [123] THUNDERSTORM WIND 65MPH        THUNDERSTORM WIND/ TREES      
## [125] THUNDERSTORM WIND/AWNING       THUNDERSTORM WIND 98 MPH      
## [127] THUNDERSTORM WIND TREES        THUNDERSTORM WIND 59 MPH      
## [129] THUNDERSTORM WINDS 63 MPH      THUNDERSTORM WIND/ TREE       
## [131] THUNDERSTORM WIND 65 MPH       THUNDERSTORM WIND.            
## [133] THUNDERSTORM WIND 59 MPH.      THUNDERSTORM WINDSHAIL        
## [135] THUDERSTORM WINDS              STORM FORCE WINDS             
## [137] THUNDERSTORM WINDS AND         TSTM WIND DAMAGE              
## [139] THUNDERSTORM WINDS 50          THUNDERSTORM WIND G52         
## [141] THUNDERSTORM WINDS 52          THUNDERSTORM WIND G51         
## [143] THUNDERSTORM WIND G61          THUNDERESTORM WINDS           
## [145] THUNDERSTORM WINDS/FLOODING    THUNDEERSTORM WINDS           
## [147] THUNDERSTORM WIND 50           THUNERSTORM WINDS             
## [149] HIGH WINDS/COLD                COLD/WINDS                    
## [151] THUNDERSTORM WIND 56           ICE/STRONG WINDS              
## [153] EXTREME WIND CHILL/BLOWING SNO SNOW/HIGH WINDS               
## [155] HIGH WINDS/SNOW                HEAVY SNOW AND STRONG WINDS   
## [157] BLOWING SNOW/EXTREME WIND CHIL THUNDERSTORM WIND/HAIL        
## [159] TSTM WINDS                     TSTM WIND 65)                 
## [161] THUNDERSTORM WINDS/ FLOOD      HIGH WIND AND SEAS            
## [163] THUNDERSTORMWINDS              THUNDERSTROM WIND             
## [165] HIGH WIND 48                   EXTREME WINDCHILL             
## [167] TSTM WIND/HAIL                 High Wind                     
## [169] Tstm Wind                      Wind                          
## [171] Wind Damage                    Strong Wind                   
## [173] Thunderstorm Wind              Strong Winds                  
## [175] Strong winds                   Whirlwind                     
## [177] Gusty Wind                     Gradient wind                 
## [179] TSTM WIND (G45)                Gusty Winds                   
## [181] GUSTY WIND                     TSTM WIND 40                  
## [183] TSTM WIND 45                   TSTM WIND (41)                
## [185] TSTM WIND (G40)                TSTM WND                      
## [187]  TSTM WIND                     STRONG WIND GUST              
## [189] Gusty winds                    GRADIENT WIND                 
## [191] Flood/Strong Wind              TSTM WIND AND LIGHTNING       
## [193] gradient wind                  Heavy surf and wind           
## [195]  TSTM WIND (G45)               TSTM WIND  (G45)              
## [197] HIGH WIND (G40)                TSTM WIND (G35)               
## [199] WAKE LOW WIND                  COLD WIND CHILL TEMPERATURES  
## [201] BITTER WIND CHILL              BITTER WIND CHILL TEMPERATURES
## [203] WIND ADVISORY                  GUSTY WIND/HAIL               
## [205] EXTREME WINDCHILL TEMPERATURES WIND AND WAVE                 
## [207]  WIND                          TSTM WIND G45                 
## [209] NON-SEVERE WIND DAMAGE         THUNDERSTORM WIND (G40)       
## [211] WIND GUSTS                     GUSTY LAKE WIND               
## [213] WND                            NON-TSTM WIND                 
## [215] NON TSTM WIND                  GUSTY THUNDERSTORM WINDS      
## [217] MARINE TSTM WIND               WHIRLWIND                     
## [219] EXTREME COLD/WIND CHILL        GUSTY THUNDERSTORM WIND       
## [221] COLD/WIND CHILL                MARINE HIGH WIND              
## [223] MARINE THUNDERSTORM WIND       MARINE STRONG WIND            
## 224 Levels:  TSTM WIND  TSTM WIND (G45)  WIND ... WND
unique(droplevels(stormdata[stormdata$catType == "HAIL",]$EVTYPE))
##  [1] HAIL              HAIL 1.75)        HAIL STORM       
##  [4] HAIL 75           SMALL HAIL        HAIL 80          
##  [7] FUNNEL CLOUD/HAIL HAIL 0.75         HAIL 1.00        
## [10] HAIL 1.75         HAIL 225          HAIL 0.88        
## [13] DEEP HAIL         HAIL 88           HAIL 175         
## [16] HAIL 100          HAIL 150          HAIL 075         
## [19] HAIL 125          HAIL 200          HAIL DAMAGE      
## [22] THUNDERSTORM HAIL HAIL 088          HAIL/ICY ROADS   
## [25] HAIL ALOFT        HAIL 275          HAIL 450         
## [28] HAILSTORM         HAILSTORMS        small hail       
## [31] Hail(0.75)        Small Hail        LATE SEASON HAIL 
## [34] NON SEVERE HAIL   MARINE HAIL      
## 35 Levels: DEEP HAIL FUNNEL CLOUD/HAIL HAIL HAIL 0.75 ... THUNDERSTORM HAIL
unique(droplevels(stormdata[stormdata$catType == "FLOOD",]$EVTYPE))
##  [1] ICE STORM/FLASH FLOOD          FLASH FLOOD                   
##  [3] FLASH FLOODING                 FLOODING                      
##  [5] FLOOD                          FLASH FLOODING/THUNDERSTORM WI
##  [7] BREAKUP FLOODING               RIVER FLOOD                   
##  [9] COASTAL FLOOD                  FLOOD WATCH/                  
## [11] FLASH FLOODS                   HEAVY SURF COASTAL FLOODING   
## [13] URBAN FLOODING                 URBAN/SMALL FLOODING          
## [15] LOCAL FLOOD                    FLOOD/FLASH FLOOD             
## [17] URBAN/SMALL STREAM FLOODING    STREAM FLOODING               
## [19] FLASH FLOOD/                   SMALL STREAM URBAN FLOOD      
## [21] URBAN FLOOD                    COASTAL FLOODING              
## [23] URBAN/SMALL STREAM FLOOD       MINOR FLOODING                
## [25] URBAN/SMALL STREAM  FLOOD      URBAN AND SMALL STREAM FLOOD  
## [27] SMALL STREAM FLOODING          FLOODS                        
## [29] SMALL STREAM AND URBAN FLOODIN SMALL STREAM/URBAN FLOOD      
## [31] SMALL STREAM AND URBAN FLOOD   RURAL FLOOD                   
## [33] MAJOR FLOOD                    ICE JAM FLOODING              
## [35] STREET FLOOD                   SMALL STREAM FLOOD            
## [37] LAKE FLOOD                     URBAN AND SMALL STREAM FLOODIN
## [39] RIVER AND STREAM FLOOD         MINOR FLOOD                   
## [41] RIVER FLOODING                 FLOOD/RIVER FLOOD             
## [43] MUD SLIDES URBAN FLOODING      HAIL FLOODING                 
## [45] LOCAL FLASH FLOOD              FLOOD/FLASH FLOODING          
## [47] COASTAL/TIDAL FLOOD            FLASH FLOOD/FLOOD             
## [49] FLASH FLOOD FROM ICE JAMS      FLASH FLOOD/ STREET           
## [51] FLOOD FLASH                    FLOOD FLOOD/FLASH             
## [53] TIDAL FLOOD                    FLOOD/FLASH                   
## [55] HIGHWAY FLOODING               FLASH FLOOD/ FLOOD            
## [57] BEACH EROSION/COASTAL FLOOD    RAPIDLY RISING WATER          
## [59] SNOWMELT FLOODING              FLASH FLOODING/FLOOD          
## [61] FLASH FLOOODING                BEACH FLOOD                   
## [63] FLOOD/FLASHFLOOD               URBAN SMALL STREAM FLOOD      
## [65] URBAN FLOOD LANDSLIDE          URBAN FLOODS                  
## [67] FLASH FLOOD/LANDSLIDE          LANDSLIDE/URBAN FLOOD         
## [69] FLASH FLOOD LANDSLIDES         URBAN/SML STREAM FLD          
## [71] Minor Flooding                 Ice jam flood (minor          
## [73] Coastal Flooding               COASTALFLOOD                  
## [75] Erosion/Cstl Flood             Tidal Flooding                
## [77] River Flooding                 Flood/Flash Flood             
## [79] STREET FLOODING                Flood                         
## [81] TIDAL FLOODING                  COASTAL FLOOD                
## [83] Urban Flooding                 Urban flood                   
## [85] Urban Flood                    Coastal Flood                 
## [87] coastal flooding               Sml Stream Fld                
## [89] URBAN/SML STREAM FLDG          URBAN/SMALL STRM FLDG         
## [91] COASTAL FLOODING/EROSION       URBAN/STREET FLOODING         
## [93] COASTAL  FLOODING/EROSION      FLOOD/FLASH/FLOOD             
## [95]  FLASH FLOOD                   CSTL FLOODING/EROSION         
## [97] LAKESHORE FLOOD               
## 97 Levels:  COASTAL FLOOD  FLASH FLOOD ... URBAN/STREET FLOODING

As you can see, there were some decisions about the order of the pattern matching, that would classify THUNDERSTORM WIND as WIND and not THUNDERSTORM. But with just a handful of cases, these appear to be clean categorizations.

Human and Economic Impacts

The impacts we want to study are to the population health and economic consequences. The NOAA dataset provides the following data attributes

  • FATALITIES number of attributed fatalaties
  • INJURIES number of attributed injuries
  • PROPDMG and PROPDMGEXP provides estimates for property damage
  • CROPDMG and CROPDMGEXP provides estimates for crop damage

Details of how these estimates are calculated are covered in the Data Document, along with guidelines for what injuries/fatalies should be attributed to the storm event as opposed to some incidental event (e.g. a traffic accident during a flood evacuation). There is also an appendix that provides guidelines for estimating the dollar amount of damage.

The financial attributes will need to be combined in order to get an estimated dollar amount

summary(as.factor(toupper(stormdata$CROPDMGEXP)))
##             ?      0      2      B      K      M 
## 618413      7     19      1      9 281853   1995
summary(as.factor(toupper(stormdata$PROPDMGEXP)))
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      H      K      M 
##      4      5      1     40      7 424665  11337

Most of it is either blank (we will have to ensure the corresponding damage estimates are blank are appropriately scaled > $1000); one of the known (B) billon, (M) million, or (K) thousand units. Lets go ahead and calculate a normalized damage impact for the 3 known magnitudes.

stormdata$CROPDMGEXP <- toupper(stormdata$CROPDMGEXP)
stormdata$PROPDMGEXP <- toupper(stormdata$PROPDMGEXP)

# make sure everything is in line
summary(stormdata[stormdata$PROPDMGEXP == "K",]$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.90   25.28   10.00 5000.00
summary(stormdata[stormdata$PROPDMGEXP == "M",]$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.50    2.50   12.41    6.00  929.00
summary(stormdata[stormdata$PROPDMGEXP == "B",]$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   1.450   2.300   6.896   5.000 115.000
summary(stormdata[stormdata$CROPDMGEXP == "K",]$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.766   0.000 990.000
summary(stormdata[stormdata$CROPDMGEXP == "M",]$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    3.00   17.12    9.00  596.00
summary(stormdata[stormdata$CROPDMGEXP == "B",]$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.200   0.500   1.512   1.510   5.000
# the "K" in PROPDMG had an outlier over 1000
stormdata[stormdata$PROPDMGEXP == "K" & stormdata$PROPDMG > 999,c("PROPDMG", "EVTYPE")]
##        PROPDMG            EVTYPE
## 749050    3000         HIGH WIND
## 750967    4410           TORNADO
## 755020    3000             FLOOD
## 762429    3500 THUNDERSTORM WIND
## 778568    5000 THUNDERSTORM WIND
## 789445    1000     COASTAL FLOOD
## 791403    4800         LANDSLIDE
## 808182    5000       FLASH FLOOD
## 808183    5000       FLASH FLOOD
## 811973    1000         LIGHTNING
## 815064    3200 THUNDERSTORM WIND
## 815187    1000           TORNADO
## 821163    1000       FLASH FLOOD
## 823219    1000       FLASH FLOOD
## 823303    1000       FLASH FLOOD
## 844264    1000       STRONG WIND
## 848518    1000          WILDFIRE
## 848601    1584         LANDSLIDE
## 878681    1000             FLOOD
## 881000    3000       FLASH FLOOD
## 900685    5000        WATERSPOUT
## 900997    1000           TORNADO

Taking a look, those seem like they were trying to estimate damage appropriately, for instance the record 749050 with 3000 in Property damage is reasonable to assume that is $3 million.

stormdata$propDamage <- ifelse(stormdata$PROPDMGEXP == "K", 1000 * stormdata$PROPDMG, NA) 
stormdata$propDamage <- ifelse(stormdata$PROPDMGEXP == "M", 1000000 * stormdata$PROPDMG, stormdata$propDamage)
stormdata$propDamage <- ifelse(stormdata$PROPDMGEXP == "B", 1000000000 * stormdata$PROPDMG, stormdata$propDamage)
stormdata$cropDamage <- ifelse(stormdata$CROPDMGEXP == "K", 1000 * stormdata$CROPDMG, NA) 
stormdata$cropDamage <- ifelse(stormdata$CROPDMGEXP == "M", 1000000 * stormdata$CROPDMG, stormdata$cropDamage)
stormdata$cropDamage <- ifelse(stormdata$CROPDMGEXP == "B", 1000000000 * stormdata$CROPDMG, stormdata$cropDamage)

An approximation is made with the remaining damage estimates

# indices where we have a positive value in the raw data but no tidy estimate
noTidyCropDmg <- !is.na(stormdata$CROPDMG) & is.na(stormdata$cropDamage) & (stormdata$CROPDMG > 0)

# average crop damage by type for the missing estimates 
group_by(stormdata, catType) %>% 
    summarize(meanCROPDMG=mean(CROPDMG, na.rm=TRUE), 
              meanCropDamage=mean(cropDamage, na.rm=TRUE),
              count=n(),
              countWithDamage=sum(!is.na(cropDamage))) %>% 
    filter(!is.nan(meanCropDamage)) %>% 
    subset(catType %in% unique(stormdata[noTidyCropDmg,]$catType))
## # A tibble: 3 x 5
##   catType meanCROPDMG meanCropDamage  count countWithDamage
##    <fctr>       <dbl>          <dbl>  <int>           <int>
## 1    WIND    0.613399       18532.22 364880          110429
## 2    HAIL    2.010082       36919.44 289276           82528
## 3 TORNADO    1.648294       43238.65  60685            9597
# since crop damage is all in thousands use that factor
stormdata$cropDamage <- ifelse(is.na(stormdata$cropDamage) & (stormdata$CROPDMG > 0), 1000 * stormdata$CROPDMG, stormdata$cropDamage) 

# and check the means again
group_by(stormdata, catType) %>% 
    summarize(meanCROPDMG=mean(CROPDMG, na.rm=TRUE), 
              meanCropDamage=mean(cropDamage, na.rm=TRUE),
              count=n(),
              countWithDamage=sum(!is.na(cropDamage))) %>% 
    filter(!is.nan(meanCropDamage)) %>% 
    subset(catType %in% unique(stormdata[noTidyCropDmg,]$catType))
## # A tibble: 3 x 5
##   catType meanCROPDMG meanCropDamage  count countWithDamage
##    <fctr>       <dbl>          <dbl>  <int>           <int>
## 1    WIND    0.613399       18531.34 364880          110439
## 2    HAIL    2.010082       36918.82 289276           82530
## 3 TORNADO    1.648294       43241.80  60685            9600
# Look at property damage
noTidyPropDmg <- !is.na(stormdata$PROPDMG) & is.na(stormdata$propDamage) & (stormdata$PROPDMG > 0)


# and average prop damage by type for the missing estimates 
group_by(stormdata, catType) %>% 
    summarize(meanPROPDMG=mean(PROPDMG, na.rm=TRUE), 
              meanPropDamage=mean(propDamage, na.rm=TRUE),
              count=n(),
              countWithDamage=sum(!is.na(propDamage))) %>% 
    filter(!is.nan(meanPropDamage)) %>% 
    subset(catType %in% unique(stormdata[noTidyPropDmg,]$catType))
## # A tibble: 9 x 5
##     catType meanPROPDMG meanPropDamage  count countWithDamage
##      <fctr>       <dbl>          <dbl>  <int>           <int>
## 1      WIND    8.590876       97062.24 364880          182719
## 2      HAIL    2.382900      173271.23 289276           92191
## 3     FLOOD   28.582123     3176728.08  86091           52748
## 4   TORNADO   52.970789     1098395.71  60685           51841
## 5      SNOW    8.087139      347295.32  37327           22266
## 6 LIGHTNING   38.278994       84650.38  15763           10971
## 7      RAIN    4.973675      548224.96  12463            5980
## 8      HEAT    2.395225       71554.78   4109            1821
## 9       ICE   35.018146     2898129.95   2131            1366
# flood, hail, ice, tornado all in the millions
stormdata$propDamage <- ifelse(is.na(stormdata$propDamage) &
                                   (stormdata$catType %in% c("FLOOD","HAIL","ICE","TORNADO")) &
                                   (stormdata$PROPDMG > 0), 
                               1000000 * stormdata$PROPDMG, stormdata$propDamage) 
# everything else at 1K
stormdata$propDamage <- ifelse(is.na(stormdata$propDamage) &
                                   (stormdata$PROPDMG > 0), 
                               1000 * stormdata$PROPDMG, stormdata$propDamage) 


# confirm we cleared it all up
# Look at property damage
sum(!is.na(stormdata$PROPDMG) & is.na(stormdata$propDamage) & (stormdata$PROPDMG > 0))
## [1] 0
# and recheck the means
group_by(stormdata, catType) %>% 
    summarize(meanPROPDMG=mean(PROPDMG, na.rm=TRUE), 
              meanPropDamage=mean(propDamage, na.rm=TRUE),
              count=n(),
              countWithDamage=sum(!is.na(propDamage))) %>% 
    filter(!is.nan(meanPropDamage)) %>% 
    subset(catType %in% unique(stormdata[noTidyPropDmg,]$catType))
## # A tibble: 9 x 5
##     catType meanPROPDMG meanPropDamage  count countWithDamage
##      <fctr>       <dbl>          <dbl>  <int>           <int>
## 1      WIND    8.590876       96977.78 364880          182946
## 2      HAIL    2.382900      176787.45 289276           92211
## 3     FLOOD   28.582123     3185621.93  86091           52792
## 4   TORNADO   52.970789     1103902.03  60685           51853
## 5      SNOW    8.087139      347249.10  37327           22269
## 6 LIGHTNING   38.278994       84526.89  15763           10989
## 7      RAIN    4.973675      548133.64  12463            5981
## 8      HEAT    2.395225       71597.83   4109            1822
## 9       ICE   35.018146     2932586.33   2131            1367
# and set up the data frames to drive the analysis
# combines them all all once, and add totals for casulaties and damage
resultN <- stormdata %>% group_by(catType) %>%
    summarize(withInjury=sum(INJURIES > 0, na.rm=TRUE), 
              withFatality=sum(FATALITIES >0, na.rm=TRUE),
              withPropDamage=sum(propDamage > 0, na.rm=TRUE),
              withCropDamage=sum(cropDamage > 0, na.rm=TRUE),
              sumInjury=sum(INJURIES, na.rm=TRUE), 
              sumFatality=sum(FATALITIES, na.rm=TRUE),
              propDamage=sum(propDamage, na.rm=TRUE), 
              cropDamage=sum(cropDamage, na.rm=TRUE),
              count=n()) %>%
    mutate(casualties = sumInjury + sumFatality) %>%
    mutate(damage = propDamage + cropDamage) %>%
    mutate(perInjury = withInjury/count) %>%
    mutate(perFatality = withFatality/count) %>%
    mutate(perPropDamage = withPropDamage/count) %>%
    mutate(perCropDamage = withCropDamage/count)




# top10 results by casualites and damage (and combined top impacts)
result10cas <- head(resultN[order(resultN$casualties,decreasing=TRUE),], 10)
result10dmg <- head(resultN[order(resultN$damage,decreasing=TRUE),], 10)
combinedType <- intersect(result10cas$catType, result10dmg$catType)
result10 <- resultN[resultN$catType %in% combinedType,]

This chart outlines the top events impacts to total casualities (fatalities and injuries) and total estimated damage. As you can see the FLOOD events by far cause the most damage, and the TORNADO events by far cause the most impact to human health.

ggplot(result10, aes(x=damage, y=casualties)) +
    geom_point(aes(col=catType, size=sumFatality)) +
    labs(title="Damage Vs Casualties", 
         x="Damage", 
         y="Casualties",
         color="Storm Type",
         size="Fatalities") +
    scale_x_continuous(labels=scales::unit_format("$B", 1e-9)) + 
    guides(color=guide_legend(override.aes=(list(size=4)))) +
    theme_bw()

These dataframes will be used for more detailed exploration on the impacts

annual <- stormdata %>% 
    mutate(year = as.Date(format(date,"%Y/01/01"))) %>% 
    group_by(year, catType) %>% 
    summarize(propDamage=sum(propDamage, na.rm=TRUE), cropDamage=sum(cropDamage, na.rm=TRUE), 
              INJURIES=sum(INJURIES, na.rm=TRUE), FATALITIES=sum(FATALITIES, na.rm=TRUE)) %>%
    mutate(damage=propDamage + cropDamage, casualties = INJURIES + FATALITIES) %>%
    mutate(isTornado=(catType == "TORNADO"))


annualTotal <- group_by(annual, year) %>% 
    summarize(propDamage=sum(propDamage, na.rm=TRUE), cropDamage=sum(cropDamage, na.rm=TRUE), 
              INJURIES=sum(INJURIES, na.rm=TRUE), FATALITIES=sum(FATALITIES, na.rm=TRUE),
              damage=sum(damage), casualties=sum(casualties))

annualTotal2 <- group_by(annual, year, isTornado) %>% 
    summarize(propDamage=sum(propDamage, na.rm=TRUE), 
              cropDamage=sum(cropDamage, na.rm=TRUE), 
              INJURIES=sum(INJURIES, na.rm=TRUE), 
              FATALITIES=sum(FATALITIES, na.rm=TRUE),
              damage=sum(damage), 
              casualties=sum(casualties))

summary(annualTotal)
##       year              propDamage          cropDamage       
##  Min.   :1950-01-01   Min.   :3.448e+07   Min.   :0.000e+00  
##  1st Qu.:1965-04-02   1st Qu.:2.041e+08   1st Qu.:0.000e+00  
##  Median :1980-07-02   Median :1.080e+09   Median :0.000e+00  
##  Mean   :1980-07-02   Mean   :6.913e+09   Mean   :7.920e+08  
##  3rd Qu.:1995-10-01   3rd Qu.:5.523e+09   3rd Qu.:1.207e+09  
##  Max.   :2011-01-01   Max.   :1.219e+11   Max.   :5.807e+09  
##     INJURIES       FATALITIES          damage            casualties   
##  Min.   :  524   Min.   :  24.00   Min.   :3.448e+07   Min.   :  558  
##  1st Qu.: 1044   1st Qu.:  55.75   1st Qu.:2.041e+08   1st Qu.: 1098  
##  Median : 1790   Median :  96.50   Median :1.080e+09   Median : 1864  
##  Mean   : 2267   Mean   : 244.27   Mean   :7.705e+09   Mean   : 2511  
##  3rd Qu.: 2722   3rd Qu.: 408.25   3rd Qu.:7.852e+09   3rd Qu.: 3167  
##  Max.   :11177   Max.   :1491.00   Max.   :1.255e+11   Max.   :11864

Human Impact

The Casualty compared to Damage graphic shows that Tornados and Floods are the biggest impacts to the human population.

This shows how likely an event will result in injury an is in the most frequent event categories and the magnitude of injuries

ggplot(result10, aes(catType)) +
    geom_point(aes(y = perInjury, size=sumInjury, col="Injury")) + 
    geom_point(aes(y = perFatality, size=sumFatality, col="Fatality")) + 
    labs(title="Liklihood and Magnitude of Injuries", 
         x="Event Type", 
         y="Liklihood (%)",
         size="Casualities",
         col="Type of Impact") +
    coord_flip() + 
    theme_bw()

This shows us that over 10% of Tornado result in some injuries, in fact these are the types of events that cause the most injuries.

Economic Impact

Using a similar approach to the human impact we can see the liklihood of property damage and the magnitude of the damage based on the most frequently occuring events.

ggplot(result10, aes(catType)) +
    geom_point(aes(y = perCropDamage, size=cropDamage, col="Crop")) + 
    geom_point(aes(y = perPropDamage, size=propDamage, col="Property")) + 
    labs(title="Liklihood and Magnitude of Crop Damage", 
         x="Event Type", 
         y="Liklihood (%)",
         size="Damage ($)",
         col="Type") +
    scale_size_continuous(labels=comma) + 
    scale_y_continuous(labels=percent) +
    coord_flip() + 
    theme_bw()

Across the board we can see that Hurricane cause the events that are expected to most likley cause damage, however floods cause the most in terms of dollar costs to both property and crops.

Results

The results of this analysis on the data set show that Tornado events are the cause of most injuries, in addition if an individual is in a Tornado, it is more likely to be injured than in the other commonly occuring events.

Flood events cause the most damage, however it is more likely a Hurricane or Tornado would result in property damage, when the Flood does occur the damage can be signciantly more.