Analysis of National Oceanic and Atmospheric Administration’s storm database for population health and economic impact.

1. Synopsys

The U.S. National Oceanic and Atmospheric Administration (NOAA) maintains a database which that tracks major storm and weather events in the United States. This project attempts to analyze this database for impact on human health and economy due to these events. The database provides detailed information about the event, location and its impact on human life and economy.

It can be summarized that tornadoes and excessive heat have the greatest impact on population fatalities and injuries, while flood and hurrican/typhoon have the greatest impact on property and cost damage in the U.S. in the approximately 45 years that the data was collected.

2. Data Processing

This section discusses the journey of data from its source through to the final preparation before generating results.

2.1 Data Source

Database source URL: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

The data file pointed to by the URL was downloaded and unzipped on the local Mac (OSX 10.11.16) on 07/20/2017

library(dplyr)
library (ggplot2)
library(lattice)
library(grid)
library(gridExtra)

stormdbFile <- "repdata%2Fdata%2FStormData.csv"
rawDF <- read.csv(stormdbFile)
rawDF <- tbl_df(rawDF)

The csv data is properly delimited for reading and a heading row has variable names referenced in the NOAA codebook.

The data file is quite large with 900K observations for 37 variables beginning in the year 1966 and ending in 2011 as evidenced by the outputs below.

str(rawDF)

## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

tail(rawDF)

## # A tibble: 6 × 37
##   STATE__           BGN_DATE    BGN_TIME TIME_ZONE COUNTY
##     <dbl>             <fctr>      <fctr>    <fctr>  <dbl>
## 1      47 11/28/2011 0:00:00 03:00:00 PM       CST     21
## 2      56 11/30/2011 0:00:00 10:30:00 PM       MST      7
## 3      30 11/10/2011 0:00:00 02:48:00 PM       MST      9
## 4       2  11/8/2011 0:00:00 02:58:00 PM       AKS    213
## 5       2  11/9/2011 0:00:00 10:21:00 AM       AKS    202
## 6       1 11/28/2011 0:00:00 08:00:00 PM       CST      6
## # ... with 32 more variables: COUNTYNAME <fctr>, STATE <fctr>,
## #   EVTYPE <fctr>, BGN_RANGE <dbl>, BGN_AZI <fctr>, BGN_LOCATI <fctr>,
## #   END_DATE <fctr>, END_TIME <fctr>, COUNTY_END <dbl>, COUNTYENDN <lgl>,
## #   END_RANGE <dbl>, END_AZI <fctr>, END_LOCATI <fctr>, LENGTH <dbl>,
## #   WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>,
## #   PROPDMG <dbl>, PROPDMGEXP <fctr>, CROPDMG <dbl>, CROPDMGEXP <fctr>,
## #   WFO <fctr>, STATEOFFIC <fctr>, ZONENAMES <fctr>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <fctr>,
## #   REFNUM <dbl>

2.2 Data Selection

For analyzing population impact, this project simply uses EVTYPE, FATALITIES and INJURIES. FATALITIES and INJURIES are aggregated for all the severe weather events.

For analyzing economic impact, this project uses EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. The damage to property and crops is aggregated for all the weather events.

In either case, no attempt is being made here to analyze data per county or by calendar year or latitude/longtitude.

popDF.raw <- rawDF %>%
               select(EVTYPE,FATALITIES,INJURIES) %>%
               group_by(EVTYPE) %>%
               summarize(TOTAL.FATALITIES = sum(FATALITIES), TOTAL.INJURIES = sum(INJURIES)) %>%
               arrange(desc(TOTAL.FATALITIES), desc(TOTAL.INJURIES))

ecoDF.raw <- select(rawDF, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

sumFatalities <- sum(popDF.raw$TOTAL.FATALITIES)
sumInjuries <- sum(popDF.raw$TOTAL.INJURIES)

The population data is summarized and sorted to highlight an issue with data cleanliness as discussed below.

2.3 Data preparation for population impact analysis

In popDF.raw as obtained above, the EVTYPE variable is not very distinct for some events. For example, let’s look at the top 25 entries below.

print(popDF.raw, n=25)

## # A tibble: 985 × 3
##                     EVTYPE TOTAL.FATALITIES TOTAL.INJURIES
##                     <fctr>            <dbl>          <dbl>
## 1                  TORNADO             5633          91346
## 2           EXCESSIVE HEAT             1903           6525
## 3              FLASH FLOOD              978           1777
## 4                     HEAT              937           2100
## 5                LIGHTNING              816           5230
## 6                TSTM WIND              504           6957
## 7                    FLOOD              470           6789
## 8              RIP CURRENT              368            232
## 9                HIGH WIND              248           1137
## 10               AVALANCHE              224            170
## 11            WINTER STORM              206           1321
## 12            RIP CURRENTS              204            297
## 13               HEAT WAVE              172            309
## 14            EXTREME COLD              160            231
## 15       THUNDERSTORM WIND              133           1488
## 16              HEAVY SNOW              127           1021
## 17 EXTREME COLD/WIND CHILL              125             24
## 18             STRONG WIND              103            280
## 19                BLIZZARD              101            805
## 20               HIGH SURF              101            152
## 21              HEAVY RAIN               98            251
## 22            EXTREME HEAT               96            155
## 23         COLD/WIND CHILL               95             12
## 24               ICE STORM               89           1975
## 25                WILDFIRE               75            911
## # ... with 960 more rows

The NOAA documentation does not provide clarity in the difference between “HEAT” “EXCESSIVE HEAT” and “EXTREME HEAT”. Therefore, it may be more meaningful to combine some of these events as they appear near the top of our impact list. Some like “WIND” and “WINDS” seem like typos. Some others like “TSTM” and “THUNDERSTORM” mean the same thing but are inconsistently labeled.

These issues have been fixed by relabeling EVTYPE as shown below.

popDF <- popDF.raw %>%
          mutate( EVTYPE = gsub("TSTM", "THUNDERSTORM", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("WINDS", "WIND", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("^WIND$", "HIGH WIND", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("STRONG HIGH WIND", "HIGH WIND", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("STRONG WIND", "HIGH WIND", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("^HEAT$", "EXCESSIVE HEAT", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("EXTREME HEAT", "EXCESSIVE HEAT", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("RECORD/EXCESSIVE HEAT", "EXCESSIVE HEAT", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("HEAT WAVES", "HEAT WAVE", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("FLOODING", "FLOOD", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("FLOODS", "FLOOD", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("FLOOD/FLASH FLOOD", "FLASH FLOOD/FLOOD", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("CURRENTS", "CURRENT", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("DENSE FOG", "FOG", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("^HIGH SURF$", "HEAVY SURF/HIGH SURF", EVTYPE) ) %>%
          mutate( EVTYPE = gsub("^HEAVY SURF$", "HEAVY SURF/HIGH SURF", EVTYPE) ) %>%
          ## Combine and re-sort all the events again
          group_by(EVTYPE) %>%
          summarize(TOTAL.FATALITIES = sum(TOTAL.FATALITIES), TOTAL.INJURIES = sum(TOTAL.INJURIES), 
                    FPCT = TOTAL.FATALITIES/sumFatalities, IPCT = TOTAL.INJURIES/sumInjuries)

The last lines of the code re-aggregate the newly labeled data.

2.4 Data preparation for economic impact analysis

The PROPDMGEXP variable has several invalid values as shown below.

unique(ecoDF.raw["PROPDMGEXP"])

## # A tibble: 19 × 1
##    PROPDMGEXP
##        <fctr>
## 1           K
## 2           M
## 3            
## 4           B
## 5           m
## 6           +
## 7           0
## 8           5
## 9           6
## 10          ?
## 11          4
## 12          2
## 13          3
## 14          h
## 15          7
## 16          H
## 17          -
## 18          1
## 19          8

The only valid values are k, K, b, B, m or M. Invalid values like numbers etc. need to be imputed. We will arbitarily assign a “K” or a thousand dollar multiplier in such cases. This is assuming that if the economic impact would have been large, the data entry and review at NOAA would have been more thorough.

Although not explicitly shown, a similar issue exists for CROPDMGEXP, so we will impute that variable as well.

Both are resolved as shown below.

ecoDF.temp1 <- ecoDF.raw %>%
               mutate(PROPDMGEXP = gsub("[0-9?hH\\+\\-]", "k", PROPDMGEXP)) %>%
               mutate(CROPDMGEXP = gsub("[0-9?]", "k", CROPDMGEXP))

The next step would be to convert the PROPDMGEXP and CROPDMGEXP into multipliers for thousands, millions and billions

ecoDF.temp2 <- ecoDF.temp1 %>%
               mutate(PROPMULTIPLIER = ifelse(grepl("[bB]",PROPDMGEXP), 10^9, 10^3)) %>%
               mutate(CROPMULTIPLIER = ifelse(grepl("[bB]",CROPDMGEXP), 10^9, 10^3)) %>%
               mutate(PROPMULTIPLIER = ifelse(grepl("[mM]",PROPDMGEXP), 10^6, PROPMULTIPLIER)) %>%
               mutate(CROPMULTIPLIER = ifelse(grepl("[mM]",CROPDMGEXP), 10^6, CROPMULTIPLIER)) %>%
               mutate(PROPCOST = PROPMULTIPLIER*PROPDMG) %>%
               mutate(CROPCOST = CROPMULTIPLIER*CROPDMG) %>%
               mutate(COST = CROPMULTIPLIER*CROPDMG + PROPMULTIPLIER*PROPDMG)

Finally, we can now aggregate and sort the COST for the different EVTYPEs.

ecoDF <- ecoDF.temp2 %>%
          group_by(EVTYPE) %>%
          summarize(TOTAL.COST = round(sum(COST)/10^6), 
                    PROP.COST = round(sum(PROPCOST)/10^6), 
                    CROP.COST = round(sum(CROPCOST)/10^6)) %>%
          arrange(desc(TOTAL.COST))

The following step is not absolutely necessary but will help prettify the results graph and table with cost % annotations.

sumCOST <- sum(ecoDF$TOTAL.COST)

ecoDF <-  ecoDF %>%
          mutate(COSTPCT = TOTAL.COST/sumCOST) %>%
          mutate(CROPPCT = round(CROP.COST/TOTAL.COST,2))

3. Results

We are now ready to plot data and draw conclusions.

3.1 Impact of weather events on population

The popDF data has close to 1000 event types, however the impact of several recorded events is negligible given the 45 year long time period. The impact drops off precipitously as seen below, so only use the top 20 entries for FATALITIES and INJURIES in our plots.

top_FATALITIES <- arrange(popDF, desc(TOTAL.FATALITIES)) [1:20,]
top_INJURIES   <- arrange(popDF, desc(TOTAL.INJURIES)) [1:20,]

pF <- ggplot(data=top_FATALITIES, aes(x=EVTYPE,y=TOTAL.FATALITIES)) + 
      geom_bar(stat="identity", fill = "#00FF00") + 
      scale_x_discrete(limits = top_FATALITIES$EVTYPE) + 
      geom_text(aes(label = scales::percent(FPCT)), size=3, hjust="inward", color = "black") +
      coord_flip()

pI <- ggplot(data=top_INJURIES, aes(x=EVTYPE,y=TOTAL.INJURIES)) + 
      geom_bar(stat="identity", fill = "#00DDFF") + 
      scale_x_discrete(limits = top_INJURIES$EVTYPE) + 
      geom_text(aes(label = scales::percent(IPCT)), size=3, hjust="inward", color = "black") +
      coord_flip()

grid.arrange(pF, pI, ncol=2, top = textGrob("Impact Of Weather Events On Population", 
               gp = gpar(fontsize = 20, font = 3)))

The bar plot on the left indicates human fatalities and the plot on the right shows injuries.

3.1.1 Conclusions for population fatality data

By far, TORNADOES accounted for the most deaths (almost 6000), which is just a bit more than a third of the total deaths.
EXCESSIVE HEAT and FLASH FOOD have also caused thousands of deaths, accounting for another 26% of the total deaths.
The top 5 causes have led to 70% of all deaths.
In comparison, FOG at rank #20 has caused 0.5% deaths (less than 100). And the remaining 900+ events seem relatively negligible.

3.1.2 Conclusions for population injury data

Again, TORNADOES accounted for the most injuries (approximately 90,000), which is 65% of all weather-related injuries.
The top 5 causes have led to 85% of all injuries.
In comparison TROPICAL STORM at rank#20 accounted for 0.2% of all injuries. And the remaining 900+ events seem to have had a relatively negligble impact.

In summary, TORNADOES impact human population the most and the top 5 causes for deaths and injuries are the same even if differently ordered. Finally, NOAA might review the EVTYPE categorization for typos, duplicates and removing irrelevant categorizations.

3.2 Economic impact of weather events

We can plot the economic cost of the top 25 weather events in a similar manner.

top_eco <- ecoDF [1:20,]

pF <- ggplot(data=top_eco, aes(x=EVTYPE,y=TOTAL.COST)) + 
      geom_bar(stat="identity", fill = "#00FF00") + 
      scale_x_discrete(limits = top_eco$EVTYPE) + 
      labs(y = "COST in Millions") +
      labs(title = "Economic Impact Of Weather Events") + 
      geom_text(aes(label = scales::percent(COSTPCT)), size=3, hjust="inward", color = "black") +
      coord_flip()

pF

3.2.1 Conclusions for population fatality data

By far, the economic impact of FLOOD is the highest at approximately $150 billion, which is almost a third of the total damage caused.
The next 4 top causes (HURRICANE/TYPHOON, TORNADO and STORM SURGE) are also significant, together accounting for another 40-65 billion dollars damage.
The top 5 have caused about 70% of the damage.
HEAVY RAIN/SEVERE WEATHER at rank#20 has caused 1% cost damage, and the remaining 900+ events seem relatively negligible.

3.2.2 Crop v/s Property damage conclusions

We can take a quick peek at the top_eco table where the crop damage percentage is shown int he last column.

print(arrange(top_eco[,c(1:2,6)],desc(CROPPCT)),n=10)

## # A tibble: 20 × 3
##               EVTYPE TOTAL.COST CROPPCT
##               <fctr>      <dbl>   <dbl>
## 1            DROUGHT      15019    0.93
## 2          ICE STORM       8967    0.56
## 3        RIVER FLOOD      10148    0.50
## 4          HURRICANE      14610    0.19
## 5               HAIL      18759    0.16
## 6          HIGH WIND       5909    0.11
## 7          TSTM WIND       5039    0.11
## 8  THUNDERSTORM WIND       3898    0.11
## 9        FLASH FLOOD      17563    0.08
## 10    TROPICAL STORM       8382    0.08
## # ... with 10 more rows

The table above shows that damage to property cost dominates in most cases. The notable deviation, not surprisingly, is DROUGHT where crop damage accounts for more than 90% of the overall damage.

This concludes our analysis of the NOAA database.