Synopsis

This study focused on identifying which weather events have the greatest impact on 1.population health 2.property damage 3.crop damage. The data was obtained from the NOAA Storm Database which recorded injuries, fatalities, property and crop damage values resulting from a set of categorized weather events. We looked at data starting from the year 1996 till Nov 2011.

Data Processing

The following libraries were first included.

library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)

The data was read from a csv file and the relevant columns were selected.

myData <- read.csv("repdata%2Fdata%2FStormData.csv",
                   header = TRUE,
                   stringsAsFactors = FALSE)
myDataSel <- select(myData, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,
                    PROPDMGEXP, CROPDMG, CROPDMGEXP)  

Filtering and cleanup of the data was done in the following sequence:

  1. Taking only data from Jan 1996 onward.

    • Based on a pinned discussion forum post by Usama A.F. Khalil (Mentor), all weather event types were only recorded from Jan 1996 onward.

    • As the time-resolution of the analysis was coarse, no accounting for the various timezones in the data was done. All were simply assumed to be a single time zone.

    • Only begin dates were used.

myDataSel$BGN_DATE <- mdy_hms(myDataSel$BGN_DATE)
myDataSel <- filter(myDataSel, 
                    as.numeric(BGN_DATE - mdy_hms("01/01/1996 00:00:00")) > 0)
  1. Removing extra whitespaces and setting characters to uppercase.

    • Any repetition of a whitespace was reduced to a single character.

    • Trailing and preceding whitespaces were removed.

    • All alphabetic characters were set to uppercase.

myDataSel$EVTYPE <- gsub(pattern = "( )+", replacement = " ", myDataSel$EVTYPE) %>%
                    str_trim(side = "both") %>%
                    toupper()
  1. Manual reclassification of certain event types.

    • The documentation for the dataset has a list of 48 event types. However, the event types within the dataset appeared to be significantly more than that. These 48 event types served as a reference for regrouping some of the event types in the data at this stage, though the end result was still more than the reference 48.
myDataSel$EVTYPE[myDataSel$EVTYPE == "COASTAL FLOODING"]  <- "COASTAL FLOOD"
myDataSel$EVTYPE[myDataSel$EVTYPE == "COLD"]  <- "COLD/WIND CHILL"
myDataSel$EVTYPE[myDataSel$EVTYPE == "EXTREME WINDCHILL" | myDataSel$EVTYPE == "EXTREME COLD"] <- "EXTREME COLD/WIND CHILL"
myDataSel$EVTYPE[myDataSel$EVTYPE == "FROST" | myDataSel$EVTYPE == "FREEZE"] <- "FROST/FREEZE"
myDataSel$EVTYPE[myDataSel$EVTYPE == "HEAVY SURF" | myDataSel$EVTYPE == "HEAVY SURF/HIGH SURF"] <- "HIGH SURF"
myDataSel$EVTYPE[myDataSel$EVTYPE == "HURRICANE"] <- "HURRICANE/TYPHOON"
myDataSel$EVTYPE[myDataSel$EVTYPE == "MARINE TSTM WIND"] <- "MARINE THUNDERSTORM WIND"
myDataSel$EVTYPE[myDataSel$EVTYPE == "RIP CURRENTS"] <- "RIP CURRENT"
myDataSel$EVTYPE[myDataSel$EVTYPE == "STORM SURGE"] <- "STORM SURGE/TIDE"
myDataSel$EVTYPE[myDataSel$EVTYPE == "STRONG WINDS"] <- "STRONG WIND"
myDataSel$EVTYPE[myDataSel$EVTYPE == "TSTM WIND" | myDataSel$EVTYPE == "TSTM WIND (G45)"] <- "THUNDERSTORM WIND"
myDataSel$EVTYPE[myDataSel$EVTYPE == "WINTER WEATHER" | myDataSel$EVTYPE == "WINTRY MIX"] <- "WINTER WEATHER/MIX"
  1. Some event types in the data that looked vague or looked like some regular reporting were removed.
ind <- myDataSel$EVTYPE == "OTHER" | myDataSel$EVTYPE == "MONTHLY PRECIPITATION" 
myDataSel <- myDataSel[!ind, ]
  1. Setting a minimum threshold of event type appearances.

    • An arbritrary quantity of 30 was used.
thresholdEVTypeCount <- 30
uniqueEVTypeCount <- table(myDataSel$EVTYPE)
evTypeSel <- names(uniqueEVTypeCount[uniqueEVTypeCount > thresholdEVTypeCount])
myDataSel <- filter(myDataSel, EVTYPE %in% evTypeSel)

At this stage, there were the following event types remaining:

sort(unique(myDataSel$EVTYPE))
##  [1] "ASTRONOMICAL HIGH TIDE"   "ASTRONOMICAL LOW TIDE"   
##  [3] "AVALANCHE"                "BLIZZARD"                
##  [5] "COASTAL FLOOD"            "COLD/WIND CHILL"         
##  [7] "DENSE FOG"                "DROUGHT"                 
##  [9] "DRY MICROBURST"           "DUST DEVIL"              
## [11] "DUST STORM"               "EXCESSIVE HEAT"          
## [13] "EXTREME COLD/WIND CHILL"  "FLASH FLOOD"             
## [15] "FLOOD"                    "FOG"                     
## [17] "FREEZING FOG"             "FREEZING RAIN"           
## [19] "FROST/FREEZE"             "FUNNEL CLOUD"            
## [21] "GLAZE"                    "GUSTY WINDS"             
## [23] "HAIL"                     "HEAT"                    
## [25] "HEAVY RAIN"               "HEAVY SNOW"              
## [27] "HIGH SURF"                "HIGH WIND"               
## [29] "HURRICANE/TYPHOON"        "ICE STORM"               
## [31] "LAKE-EFFECT SNOW"         "LANDSLIDE"               
## [33] "LIGHT SNOW"               "LIGHTNING"               
## [35] "MARINE HAIL"              "MARINE HIGH WIND"        
## [37] "MARINE STRONG WIND"       "MARINE THUNDERSTORM WIND"
## [39] "MIXED PRECIPITATION"      "MODERATE SNOWFALL"       
## [41] "RECORD COLD"              "RECORD HEAT"             
## [43] "RECORD WARMTH"            "RIP CURRENT"             
## [45] "RIVER FLOOD"              "SLEET"                   
## [47] "SMALL HAIL"               "SNOW"                    
## [49] "STORM SURGE/TIDE"         "STRONG WIND"             
## [51] "TEMPERATURE RECORD"       "THUNDERSTORM WIND"       
## [53] "TORNADO"                  "TROPICAL DEPRESSION"     
## [55] "TROPICAL STORM"           "TSTM WIND/HAIL"          
## [57] "UNSEASONABLY DRY"         "UNSEASONABLY WARM"       
## [59] "URBAN/SML STREAM FLD"     "WATERSPOUT"              
## [61] "WILD/FOREST FIRE"         "WILDFIRE"                
## [63] "WIND"                     "WINTER STORM"            
## [65] "WINTER WEATHER/MIX"

We proceeded to transform the data into more useful values:

  1. Convert the event types into factors.

  2. Convert the property damage value currently in mantissa+exponent form into a single variable.

  3. Convert the crop damage value currently in mantissa+exponent form into a single variable.

  4. Add the fatalities and injuries together into a casulty count.

  5. Obtain a year value from the begin date.

# Set the evtype as factor
myDataSel$EVTYPE <- as.factor(myDataSel$EVTYPE)

# Convert the exponents to numbers
propExpNum <- sapply(myDataSel$PROPDMGEXP, function(a) {
                                            a <- toupper(a)
                                            if(a == "B") {9} 
                                            else if(a == "M") {6}
                                            else if(a == "K") {3}
                                            else if(a == "H") {2}
                                            else if(a == "") {0}
                                            else {as.numeric(a)}
                                        })
cropExpNum <- sapply(myDataSel$CROPDMGEXP, function(a) {
                                            a <- toupper(a)
                                            if(a == "B") {9} 
                                            else if(a == "M") {6}
                                            else if(a == "K") {3}
                                            else if(a == "H") {2}
                                            else if(a == "") {0}
                                            else {as.numeric(a)}
                                        })
myDataSel <- mutate(myDataSel, year = year(BGN_DATE),
                        casulties = FATALITIES + INJURIES,
                        cropDmgTotal = CROPDMG * 10 ^ cropExpNum,
                        propDmgTotal = PROPDMG * 10 ^ propExpNum)
myDataSel$PROPDMG <- NULL
myDataSel$PROPDMGEXP <- NULL
myDataSel$CROPDMG <- NULL
myDataSel$CROPDMGEXP <- NULL
myDataSel$FATALITIES <- NULL
myDataSel$INJURIES <- NULL

The final structure of the data frame is

str(myDataSel)
## 'data.frame':    652202 obs. of  6 variables:
##  $ BGN_DATE    : POSIXct, format: "1996-01-06" "1996-01-11" ...
##  $ EVTYPE      : Factor w/ 65 levels "ASTRONOMICAL HIGH TIDE",..: 64 53 52 52 52 23 28 52 52 52 ...
##  $ year        : num  1996 1996 1996 1996 1996 ...
##  $ casulties   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cropDmgTotal: Named num  38000 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "names")= chr  "K" "" "" "" ...
##  $ propDmgTotal: Named num  380000 100000 3000 5000 2000 0 400000 12000 8000 12000 ...
##   ..- attr(*, "names")= chr  "K" "K" "K" "K" ...

and a cursory check for NAs show all is OK.

any(is.na(myDataSel))
## [1] FALSE



Results

To determine which weather event types had the greatest metric values, we simply summed all the data of the same weather event type and did no averaging.

myDataSel <- group_by(myDataSel, EVTYPE)
totalCasulties <- sum(myDataSel$casulties)
totalPropDmg <- sum(myDataSel$propDmgTotal)
totalCropDmg <- sum(myDataSel$cropDmgTotal)

myDataSelSummary <- summarize(myDataSel, CASULTIES = sum(casulties),
                                CASULTY_PCT = sum(casulties) / totalCasulties * 100,
                                PROP_DMG = sum(propDmgTotal),
                                PROP_DMG_PCT = sum(propDmgTotal) / totalPropDmg * 100,
                                CROP_DMG = sum(cropDmgTotal),
                                CROP_DMG_PCT = sum(cropDmgTotal) / totalCropDmg * 100,
                                NUM_EVENTS = length(casulties))



1. Effect on Population Health

The following is the top 10 event types with the greatest casulty count:

arrData <- arrange(select(myDataSelSummary, EVTYPE, CASULTIES, CASULTY_PCT, NUM_EVENTS), desc(CASULTIES))
head(arrData, n = 10)
## # A tibble: 10 × 4
##               EVTYPE CASULTIES CASULTY_PCT NUM_EVENTS
##               <fctr>     <dbl>       <dbl>      <int>
## 1            TORNADO     22178   33.539001      23153
## 2     EXCESSIVE HEAT      8188   12.382421       1656
## 3              FLOOD      7172   10.845961      24248
## 4  THUNDERSTORM WIND      5403    8.170765     210112
## 5          LIGHTNING      4790    7.243747      13203
## 6        FLASH FLOOD      2561    3.872909      51000
## 7       WINTER STORM      1483    2.242688      11315
## 8               HEAT      1459    2.206394        716
## 9  HURRICANE/TYPHOON      1446    2.186734        258
## 10         HIGH WIND      1318    1.993165      19907

The Top 10 accounted for 84.68% of all the casulties in the selected dataset.

Focusing on the Top 5, a time series (by year) casulty-count plot was generated.

myDataSel2 <- filter(myDataSel, EVTYPE %in% arrData$EVTYPE[1 : 5]) %>%
    group_by(EVTYPE, year) %>%
    summarize(casultiesInYr = sum(casulties))
ggplot(myDataSel2, aes(year, casultiesInYr / 1e3, color = EVTYPE, shape = EVTYPE)) + 
        geom_point(size = 3) +
        geom_line() + 
        labs(x = "Year", y = "Casulties (K)")
Top 5 dangers to population health

Top 5 dangers to population health

Tornadoes appear to be the biggest perennial danger to population health, though a spike in the last year drove the casulty count way up. We did not continue to further analyze if this spike was due to a single event.

Excessive Heat, the No. 2 danger was typically above the rest of the event types as well.

Floods, the No. 3 danger appeared to have been driven up the rankings by the big spike in 1998. It appears to typically fall below the rest of the Top 5.

2. Effect on Propety Damage

The following is the Top 10 event types with the greatest property damage value:

arrData <- arrange(select(myDataSelSummary, EVTYPE, PROP_DMG, PROP_DMG_PCT, NUM_EVENTS), desc(PROP_DMG))
head(arrData, n = 10)
## # A tibble: 10 × 4
##               EVTYPE     PROP_DMG PROP_DMG_PCT NUM_EVENTS
##               <fctr>        <dbl>        <dbl>      <int>
## 1              FLOOD 143944833550    39.348014      24248
## 2  HURRICANE/TYPHOON  81118659010    22.174176        258
## 3   STORM SURGE/TIDE  47834724000    13.075852        401
## 4            TORNADO  24616905710     6.729150      23153
## 5        FLASH FLOOD  15222253910     4.161076      51000
## 6               HAIL  14595143420     3.989653     207714
## 7  THUNDERSTORM WIND   7869045380     2.151041     210112
## 8     TROPICAL STORM   7642475550     2.089108        682
## 9          HIGH WIND   5247860360     1.434528      19907
## 10          WILDFIRE   4758667000     1.300805       2732

The Top 10 accounted for 96.45% of all the property damage in the selected dataset.

Again, we look at a time series plot of the Top 5.

myDataSel2 <- filter(myDataSel, EVTYPE %in% arrData$EVTYPE[1 : 5]) %>%
    group_by(EVTYPE, year) %>%
    summarize(propDamageInYr = sum(propDmgTotal))
ggplot(myDataSel2, aes(year, propDamageInYr / 1e9, color = EVTYPE, shape = EVTYPE)) + 
    geom_point(size = 3) +
    geom_line() + 
    labs(x = "Year", y = "Property Damage (B$)")
Top 5 causes of property damage

Top 5 causes of property damage

There’s clearly a small number of huge spikes here. Looking at Floods and comparing to the summary table, we can see that the single spike of about $120 billion is a major contributor to its total damage value of $144 billion. The same can also be said about Hurricane/Typoon, which had 2 major spikes, and Storm Surge/Tide (1 major spike).

3. Effect on Crop Damage

The following is the Top 10 event types with the greatest crop damage value:

arrData <- arrange(select(myDataSelSummary, EVTYPE, CROP_DMG, CROP_DMG_PCT, NUM_EVENTS), desc(CROP_DMG))
head(arrData, n = 10)
## # A tibble: 10 × 4
##                     EVTYPE    CROP_DMG CROP_DMG_PCT NUM_EVENTS
##                     <fctr>       <dbl>        <dbl>      <int>
## 1                  DROUGHT 13367566000    38.675857       2433
## 2        HURRICANE/TYPHOON  5349282800    15.476871        258
## 3                    FLOOD  4974778400    14.393332      24248
## 4                     HAIL  2476029450     7.163799     207714
## 5              FLASH FLOOD  1334901700     3.862219      51000
## 6  EXTREME COLD/WIND CHILL  1326023000     3.836531       1823
## 7             FROST/FREEZE  1250911000     3.619212       1458
## 8        THUNDERSTORM WIND   952246350     2.755097     210112
## 9               HEAVY RAIN   728169800     2.106785      11528
## 10          TROPICAL STORM   677711000     1.960795        682

The Top 10 accounted for 93.85% of all the crop damage in the selected dataset.

The time series plot of the Top 5:

myDataSel2 <- filter(myDataSel, EVTYPE %in% arrData$EVTYPE[1 : 5]) %>%
    group_by(EVTYPE, year) %>%
    summarize(cropDamageInYr = sum(cropDmgTotal))
ggplot(myDataSel2, aes(year, cropDamageInYr / 1e9, color = EVTYPE, shape = EVTYPE)) + 
    geom_point(size = 3) +
    geom_line() + 
    labs(x = "Year", y = "Crop Damage (B$)")
Top 5 causes of crop damage

Top 5 causes of crop damage

The crop damage plot has no major spikes that are at extreme distance from the “noise floor” of the plot. Drought is above the noise floor for many years (about 10) and clearly represents the most major perennial problem. Hurricane/Typhoon is significantly above the noise floor for only about 4 years.