Human and financial cost of adverse weather events in the US; 1950 - 2011

Synopsis

  • This report contains an analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which tracks major weather events and consquential human and economic cost.
  • The dataset is large, consisting of over 900,000 storm records over the last 60 years or so.
  • Recording levels rapidly increased from the mid 1990s, and coding of storm events is highly variable.
  • Once the dataset had been cleaned and recoded, it is evident that thunderstorm winds are the commonest storm event.
  • However, tornadoes are responsible for the largest number of injuries (since 1995), and thuderstorm winds, the greatest cost.
  • Reexamining the data as cost per 1000 events, and injury per 1000 events revealed that tsunamis and volcanic ash had higher per event costs, and heat and avalanche had high injury rates

Data processing

In this section of the report I describe:

  • Reading in the data
  • Cleaning the data for exploratory analysis
  • Exploratory analysis
  • Final analysis

Step 1. Download and read in the data

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "storm.csv.bz2")
storm <- read.csv("storm.csv.bz2")

Examining the structure of the data shows it to be a large dataset with nearly 1m observations and 37 variables.

str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The variables of particular interest are likely to be date of events (BGN_DATE), the type of event (EVTYPE), and the relevant damage variables (FATALITIES, INJURIES, PROPDMG, CROPDMG).

Step 2. Reduce the number of variables in the data frame and convert the date field

For much of the data management I used the dplyr package, and lubridate for data conversion

library(dplyr)
library(lubridate)

## Convert column names to lower case for ease of use

colnames(storm) <- tolower(colnames(storm))

##Select variables of interest

storm1 <- select(storm, bgn_date, evtype, fatalities, injuries, propdmg, cropdmg)

## Create date variable

storm1$newdate <- mdy_hms(as.character(storm1$bgn_date))

head(storm1)
##             bgn_date  evtype fatalities injuries propdmg cropdmg
## 1  4/18/1950 0:00:00 TORNADO          0       15    25.0       0
## 2  4/18/1950 0:00:00 TORNADO          0        0     2.5       0
## 3  2/20/1951 0:00:00 TORNADO          0        2    25.0       0
## 4   6/8/1951 0:00:00 TORNADO          0        2     2.5       0
## 5 11/15/1951 0:00:00 TORNADO          0        2     2.5       0
## 6 11/15/1951 0:00:00 TORNADO          0        6     2.5       0
##      newdate
## 1 1950-04-18
## 2 1950-04-18
## 3 1951-02-20
## 4 1951-06-08
## 5 1951-11-15
## 6 1951-11-15
## Plot the data over time

par(mfrow = c(1,3))

with(storm1, plot(table(year(newdate)), type = 'l', xlab="Year", ylab="Report frequency"))

with(storm1, plot(table(month(newdate, label = TRUE)), type = 'l', xlab="Month", ylab="Report frequency"))

with(storm1, barplot(table(wday(newdate, label = TRUE, abbr = TRUE)),  xlab="Day of week", ylab="Report frequency"))

We can see event reporting has increased over the years with a rapid increase from the mid 1990s. Events are most likely to be reported in the summmer months and during the week.

Step 3. Examine and clean the event type variable

We can see that evtype is factor variable with 985 levels, whereas the codebook suggests that there are 48 storm codes: -

Astronomical Low Tide Z
Avalanche Z
Blizzard Z
Coastal Flood Z
Cold/Wind Chill Z
Debris Flow C
Dense Fog Z
Dense Smoke Z
Drought Z
Dust Devil C
Dust Storm Z
Excessive Heat Z
Extreme Cold/Wind Chill Z
Flash Flood C
Flood C
Frost/Freeze Z
Funnel Cloud C
Freezing Fog Z
Hail C
Heat Z
Heavy Rain C
Heavy Snow Z
High Surf Z
High Wind Z
Hurricane (Typhoon) Z
Ice Storm Z
Lake-Effect Snow Z
Lakeshore Flood Z
Lightning C
Marine Hail M
Marine High Wind M
Marine Strong Wind M
Marine Thunderstorm Wind M
Rip Current Z
Seiche Z
Sleet Z
Storm Surge/Tide Z
Strong Wind Z
Thunderstorm Wind C
Tornado C
Tropical Depression Z
Tropical Storm Z
Tsunami Z
Volcanic Ash Z
Waterspout M
Wildfire Z
Winter Storm Z
Winter Weather Z

We need to examine the evtype variable more closely.

## Convert the storm codes to a look up table
lu<- c('Astronomical Low Tide',
       'Avalanche',
       'Blizzard',
       'Coastal Flood',
       'Cold/Wind Chill',
       'Debris Flow',
       'Dense Fog',
       'Dense Smoke',
       'Drought',
       'Dust Devil',
       'Dust Storm',
       'Excessive Heat',
       'Extreme Cold/Wind Chill',
       'Flash Flood',
       'Flood',
       'Frost/Freeze',
       'Funnel Cloud',
       'Freezing Fog',
       'Hail',
       'Heat',
       'Heavy Rain',
       'Heavy Snow',
       'High Surf',
       'High Wind',
       'Hurricane (Typhoon)',
       'Ice Storm',
       'Lake-Effect Snow',
       'Lakeshore Flood',
       'Lightning',
       'Marine Hail',
       'Marine High Wind',
       'Marine Strong Wind',
       'Marine Thunderstorm Wind',
       'Rip Current',
       'Seiche',
       'Sleet',
       'Storm Surge/Tide',
       'Strong Wind',
       'Thunderstorm Wind',
       'Tornado',
       'Tropical Depression',
       'Tropical Storm',
       'Tsunami',
       'Volcanic Ash', 
       'Waterspout',
       'Wildfire',
       'Winter Storm',
       'Winter Weather')

## Convert to lower case

lu<- tolower(lu)

head(table(storm1$evtype), 20); tail(table(storm1$evtype), 20)
## 
##     HIGH SURF ADVISORY          COASTAL FLOOD            FLASH FLOOD 
##                      1                      1                      1 
##              LIGHTNING              TSTM WIND        TSTM WIND (G45) 
##                      1                      4                      1 
##             WATERSPOUT                   WIND                      ? 
##                      1                      1                      1 
##        ABNORMAL WARMTH         ABNORMALLY DRY         ABNORMALLY WET 
##                      4                      2                      1 
##   ACCUMULATED SNOWFALL    AGRICULTURAL FREEZE          APACHE COUNTY 
##                      4                      6                      1 
## ASTRONOMICAL HIGH TIDE  ASTRONOMICAL LOW TIDE               AVALANCE 
##                    103                    174                      1 
##              AVALANCHE           BEACH EROSIN 
##                    386                      1
## 
##             WIND DAMAGE              WIND GUSTS              WIND STORM 
##                      27                       3                       1 
##               WIND/HAIL                   WINDS              WINTER MIX 
##                       1                      36                       3 
##            WINTER STORM WINTER STORM HIGH WINDS  WINTER STORM/HIGH WIND 
##                   11433                       1                       1 
## WINTER STORM/HIGH WINDS           WINTER STORMS          Winter Weather 
##                       1                       3                      19 
##          WINTER WEATHER      WINTER WEATHER MIX      WINTER WEATHER/MIX 
##                    7026                       6                    1104 
##             WINTERY MIX              Wintry mix              Wintry Mix 
##                       2                       3                       1 
##              WINTRY MIX                     WND 
##                      90                       1

We can see that storm coding is highly inconsistent with variations in case, spelling, code combinations and so on.

## Create a new variable to see what proportion of storm data is correctly coded
storm1$evtype1 <- tolower(storm1$evtype)
storm1$evnew <- ifelse(storm1$evtype1 %in% lu, "correct", "incorrect")

with(storm1, table(evnew, year(newdate)))
##            
## evnew        1950  1951  1952  1953  1954  1955  1956  1957  1958  1959
##   correct     223   269   272   492   609   992   968  1409  1314  1161
##   incorrect     0     0     0     0     0   421   735   775   899   652
##            
## evnew        1960  1961  1962  1963  1964  1965  1966  1967  1968  1969
##   correct    1226  1494  1559  1145  1439  1800  1338  1730  1783  1416
##   incorrect   719   752   830   823   909  1055  1050   958  1529  1510
##            
## evnew        1970  1971  1972  1973  1974  1975  1976  1977  1978  1979
##   correct    1421  1927  1456  2297  2783  2336  2026  2005  1899  2233
##   incorrect  1794  1544   712  2166  2603  2639  1742  1723  1758  2046
##            
## evnew        1980  1981  1982  1983  1984  1985  1986  1987  1988  1989
##   correct    2965  2324  3562  3329  3769  4152  4361  3111  3310  4699
##   incorrect  3181  2193  3570  4993  3566  3827  4365  4256  3947  5711
##            
## evnew        1990  1991  1992  1993  1994  1995  1996  1997  1998  1999
##   correct    4882  6019  7091  7827 11684 14617 20934 17356 22608 19221
##   incorrect  6064  6503  6443  4780  8947 13353 11336 11324 15520 12068
##            
## evnew        2000  2001  2002  2003  2004  2005  2006  2007  2008  2009
##   correct   20399 21382 21791 25774 25613 25043 29739 43217 55606 45771
##   incorrect 14072 13580 14502 13978 13750 14141 14295    72    57    46
##            
## evnew        2010  2011
##   correct   48093 62080
##   incorrect    68    94

This shows that 29.6% of events are not coded as per the codebook, although it is greatly improved since 2007. We can’t ignore this so will need to recode the evtype variable. There are 0 NAs in the dataset.

Step 4. Recoding event types

There are 852 evtype variants to be recoded.

## Exclude correctly coded variables, group by `evtype`, count frequency and sort by frequency

storm1 %>% filter(evnew == 0) %>% group_by(evtype1) %>% summarise(freq = length(evtype1)) %>% arrange(-freq)
## Source: local data frame [0 x 2]
## 
## Variables not shown: evtype1 (chr), freq (lgl)
## Recode most frequent coding errors
storm1[grep("^tstm", storm1$evtype1),]$evtype1 <- "thunderstorm wind"
storm1[grep("^thun", storm1$evtype1),]$evtype1 <- "thunderstorm wind"
storm1[grep("^marine ts", storm1$evtype1),]$evtype1 <- "marine thunderstorm wind"
storm1[grep("^hail", storm1$evtype1),]$evtype1 <- "hail"
storm1[grep("^winter w", storm1$evtype1),]$evtype1 <- "winter weather"
storm1[grep("^winter s", storm1$evtype1),]$evtype1 <- "winter storm"
storm1[grep("^tornado", storm1$evtype1),]$evtype1 <- "tornado"
storm1[grep("^lakes", storm1$evtype1),]$evtype1 <- "lakeshore flood"
storm1[grep("^wild", storm1$evtype1),]$evtype1 <- "wildfire"
storm1[grep("^waters", storm1$evtype1),]$evtype1 <- "waterspout"
storm1[grep("fld$", storm1$evtype1),]$evtype1 <- "flood"


## Review results
storm1 %>% group_by(evtype1) %>% summarise(freq = length(evtype1)) %>% arrange(-freq)
## Source: local data frame [737 x 2]
## 
##                     evtype1   freq
## 1         thunderstorm wind 324774
## 2                      hail 288776
## 3                   tornado  60686
## 4               flash flood  54277
## 5                     flood  28721
## 6                 high wind  20214
## 7                 lightning  15754
## 8                heavy snow  15708
## 9  marine thunderstorm wind  11987
## 10               heavy rain  11742
## ..                      ...    ...
## and recheck proportion of storm codes recorded as per code book
storm1$evnew <- ifelse(storm1$evtype1 %in% lu, 'correct', 'incorrect')

with(storm1, table(evnew, year(newdate)))
##            
## evnew        1950  1951  1952  1953  1954  1955  1956  1957  1958  1959
##   correct     223   269   272   492   609  1413  1703  2184  2213  1813
##   incorrect     0     0     0     0     0     0     0     0     0     0
##            
## evnew        1960  1961  1962  1963  1964  1965  1966  1967  1968  1969
##   correct    1945  2246  2389  1968  2348  2855  2388  2688  3312  2926
##   incorrect     0     0     0     0     0     0     0     0     0     0
##            
## evnew        1970  1971  1972  1973  1974  1975  1976  1977  1978  1979
##   correct    3215  3471  2168  4463  5386  4975  3768  3728  3657  4279
##   incorrect     0     0     0     0     0     0     0     0     0     0
##            
## evnew        1980  1981  1982  1983  1984  1985  1986  1987  1988  1989
##   correct    6146  4517  7132  8322  7335  7979  8726  7367  7257 10410
##   incorrect     0     0     0     0     0     0     0     0     0     0
##            
## evnew        1990  1991  1992  1993  1994  1995  1996  1997  1998  1999
##   correct   10946 12522 13534 11606 18950 25281 31458 27867 37209 30457
##   incorrect     0     0     0  1001  1681  2689   812   813   919   832
##            
## evnew        2000  2001  2002  2003  2004  2005  2006  2007  2008  2009
##   correct   33477 34115 35674 39545 39132 38875 43955 43217 55606 45771
##   incorrect   994   847   619   207   231   309    79    72    57    46
##            
## evnew        2010  2011
##   correct   48093 62080
##   incorrect    68    94

This cleaning exerise reduced the miscoding rate to 1.4% which we will accept given time constraints on analysis.

Step 5. Check injury and damage variables

## Create summaries of the the injury and damage variables

apply(storm1[,3:6], 2, summary)
##         fatalities  injuries propdmg cropdmg
## Min.       0.00000    0.0000    0.00   0.000
## 1st Qu.    0.00000    0.0000    0.00   0.000
## Median     0.00000    0.0000    0.00   0.000
## Mean       0.01678    0.1557   12.06   1.527
## 3rd Qu.    0.00000    0.0000    0.50   0.000
## Max.     583.00000 1700.0000 5000.00 990.000

Step 6. Create injury and cost variables

For this exercise I have looked at overall costs of damage, fatalities, and total injury, using the correctly coded dataset, and calculated a series of new fields:

  • Total injury
  • Total cost
  • Injury per 1000 events
  • Cost per 1000 events
  • Fatality rate per 1000 events
## filter the storm data frame to exclude miscoded events; create injury and cost variables
storm2 <- storm1 %>% filter(evnew == "correct") %>% mutate(injury = fatalities + injuries, cost = propdmg + cropdmg)

## create data summaries by event and year of total injury, total cost, injury rate, fatality rate, cost per 1000 events
storm3 <- storm2 %>% group_by(evtype1, year = year(newdate)) %>% summarise( n = length(injury), tot_inj = sum(injury), tot_cost = sum(cost), rate_inj = round(tot_inj/n * 1000,2), fatality_rate = round(sum(fatalities)/n *1000,2), cost_per1000 = round(tot_cost/n *1000,2)) %>% arrange(year, -fatality_rate) 

## Because coding levels were low prior to 1995 I have decided to use data from 1995 onwards
storm3 <- storm3 %>% filter(year > 1994)

## See which event type is most frequent, has the greatest number of injuries, and highest costs
counts <- with(storm3, tapply(n, evtype1, sum))  
which.max(counts) 
## thunderstorm wind 
##                37
injury <- with(storm3, tapply(tot_inj, evtype1, sum))
which.max(injury)
## tornado 
##      38
costs <- with(storm3, tapply(tot_cost, evtype1, sum))
which.max(costs)
## thunderstorm wind 
##                37

Results

Thunderstorm winds are the most frequently reported events since 1995, however tornados are responsible for the largest number of injuries with thunderstorm winds for the greatest cost. If we calculate the injury rate per 1000 events, summarised in Figure 1, we can see that thunderstorm winds have lower injury rates than avalanches and heat, and lower than tornadoes. In general there have been declines in injury rates per 1000 reported events over the last 15 years.

Thunderstorm winds are responsible for the greatest absolute costs, but although much less frequent, volcanic ash and tsunamis incur greater cost per event. This is shown in Figure 2.

If there were more time, further breakdown by type of cost, trend and type of injury would be undertaken.

require(ggplot2)

g<- ggplot(data = storm3, aes(year, log10(rate_inj)))
g + geom_line() + facet_wrap(~evtype1) + theme(axis.text=element_text(size=6),
        axis.title=element_text(size=10,face="bold"))+ 
      theme(strip.text.x = element_text(size = 6, colour = "blue")) +geom_smooth() + ggtitle("Fig 1. Storm related injuries per 1000 reported events; 1995-2011")

require(lattice)
bwplot(evtype1~log10(cost_per1000), data = storm3, main = "Fig2. Boxplot of costs per 1000 storm events, 1995-2011")