Assessment of US Storm Hazards - the Most Damaging Events

Data Processing

Reading and checking the data

Reading data

file <- "StormData.csv.bz2"
data <- read.csv(file)

Checking Event Type Column

str(data$EVTYPE)

##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...

985 is too much. Official list contains only 48 event types. Following offical event list is taken from the Storm Data Preparation directive, p.6:

events <- read.table("event_list_raw.txt", sep=";", col.names = "Event.Type")
sapply(events, function(x) substr(x, 1, regexpr(" [A-Z]$", x) - 1))

##       Event.Type                
##  [1,] "Astronomical Low Tide"   
##  [2,] "Avalanche"               
##  [3,] "Blizzard"                
##  [4,] "Coastal Flood"           
##  [5,] "Cold/Wind Chill"         
##  [6,] "Debris Flow"             
##  [7,] "Dense Fog"               
##  [8,] "Dense Smoke"             
##  [9,] "Drought"                 
## [10,] "Dust Devil"              
## [11,] "Dust Storm"              
## [12,] "Excessive Heat"          
## [13,] "Extreme Cold/Wind Chill" 
## [14,] "Flash Flood"             
## [15,] "Flood"                   
## [16,] "Frost/Freeze"            
## [17,] "Funnel Cloud"            
## [18,] "Freezing Fog"            
## [19,] "Hail"                    
## [20,] "Heat"                    
## [21,] "Heavy Rain"              
## [22,] "Heavy Snow"              
## [23,] "High Surf"               
## [24,] "High Wind"               
## [25,] "Hurricane (Typhoon)"     
## [26,] "Ice Storm"               
## [27,] "Lake-Effect Snow"        
## [28,] "Lakeshore Flood"         
## [29,] "Lightning"               
## [30,] "Marine Hail"             
## [31,] "Marine High Wind"        
## [32,] "Marine Strong Wind"      
## [33,] "Marine Thunderstorm Wind"
## [34,] "Rip Current"             
## [35,] "Seiche"                  
## [36,] "Sleet"                   
## [37,] "Storm Surge/Tide"        
## [38,] "Strong Wind"             
## [39,] "Thunderstorm Wind"       
## [40,] "Tornado"                 
## [41,] "Tropical Depression"     
## [42,] "Tropical Storm"          
## [43,] "Tsunami"                 
## [44,] "Volcanic Ash"            
## [45,] "Waterspout"              
## [46,] "Wildfire"                
## [47,] "Winter Storm"            
## [48,] "Winter Weather"

Event list should be standardized to conform to the offical list before one can decide which event types make the biggest impact. For this end I don’t apply any automatic string transformation. Instead I use a manual method:

selecting the EVTYPE’s of the biggest impact on health or economy
standardizing them
recalculating impact for these standardized event types

To be sure this approach allows corret identification of the most deterimental event types I will compare the resulting ordering of event types with the impact of the left out, unstandardize EVTYPE’s. We will check if standardizing these other EVTYPE’s might change the ordering at the top of the even type list.

To apply this method I need to first check and clean the numerical measures of damage.

Checking Exponents

The numerical values describing damage are also doubtful in many cases. To correct the errors I check the top highest numbers as these have the biggest impact on the ordering of event types.

1. Checking PROPDMGEXP

Let’s print and review the exponent values.

unique(data$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

We see the some checking and cleaning should be done:

check the biggest exponents i.e. B,8,7
change letters into numbers
change “ ” (space) into 0
eliminate characters like -,+,?

Checking PROPDMGEXP == B:

table(data[data$PROPDMGEXP=='B',]$PROPDMG)

## 
##   0.1     1  1.04   1.2   1.3   1.5   1.6   1.7   1.8     2  2.09   2.1 
##     1     6     1     1     1     4     1     1     1     1     1     1 
##   2.5   2.8     3     4  4.83     5  5.15  5.42  5.88  7.35    10 11.26 
##     2     1     2     3     1     2     1     1     1     1     1     1 
## 16.93  31.3   115 
##     1     1     1

Let’s check the extreme PROPDMG=115 case (115*10^9, the biggest value):

data[data$PROPDMGEXP=='B' & data$PROPDMG==115,]

##        STATE__         BGN_DATE    BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 605953       6 1/1/2006 0:00:00 12:00:00 AM       PST     55       NAPA
##        STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI         END_DATE
## 605953    CA  FLOOD         0         COUNTYWIDE 1/1/2006 0:00:00
##           END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 605953 07:00:00 AM          0         NA         0         COUNTYWIDE
##        LENGTH WIDTH  F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953      0     0 NA   0          0        0     115          B    32.5
##        CROPDMGEXP WFO          STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 605953          M MTR CALIFORNIA, Western               3828     12218
##        LATITUDE_E LONGITUDE_
## 605953       3828      12218
##                                                                                                                                                                                                                                                                                                                                                                                               REMARKS
## 605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
##        REFNUM
## 605953 605943

It looks like it’s an error and should be M instead of B. I checked other top values for exponents B, 8, 7, 6 and they look fine.

2. Checking CROPDMGEXP

unique(data$CROPDMGEXP)

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

I checked several of the highest values here and they seem OK.

Checking values of property and crop damages

1. Checking PROPDMG

summary(data$PROPDMG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00

Values >= 1000 with exponent attached looks suspicious. Let’s check them:

cond <- data$PROPDMG>=1000
dc <- data[cond,]
table(dc$PROPDMG)

## 
## 1000 1584 3000 3200 3500 4410 4800 5000 
##   10    1    3    1    1    1    1    4

dc <- dc[order(dc$PROPDMG,decreasing=TRUE),]
dc[1,]

##        STATE__          BGN_DATE    BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 778568      37 7/26/2009 0:00:00 08:11:00 PM       EST     69   FRANKLIN
##        STATE            EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI
## 778568    NC THUNDERSTORM WIND         1     WNW  STALLINGS
##                 END_DATE    END_TIME COUNTY_END COUNTYENDN END_RANGE
## 778568 7/26/2009 0:00:00 08:13:00 PM          0         NA         0
##        END_AZI END_LOCATI LENGTH WIDTH  F MAG FATALITIES INJURIES PROPDMG
## 778568                         0     0 NA  50          0        0    5000
##        PROPDMGEXP CROPDMG CROPDMGEXP WFO              STATEOFFIC
## 778568          K       0          K RAH NORTH CAROLINA, Central
##                                                                                                                                                                                                       ZONENAMES
## 778568                                                                                                                                                                                                         
##        LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 778568     3604      7810          0          0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   REMARKS
## 778568 EPISODE NARRATIVE: Showers and thunderstorms developed in the afternoon and evening along a surface trough across central North Carolina, in a warm and moist atmosphere. A mid level shear axis crossed the area during the evening helping to intensive the thunderstorms. Some of the thunderstorms during the evening became severe and produced several reports of damaging winds.EVENT NARRATIVE: Power-lines were reported down on North Carolina Highway 56. Numerous power outages were reported across eastern portions of Franklin County as well. One tree was blown onto a house on Edward Best Road.
##        REFNUM
## 778568 778558

Checking all 22 records with PROPDMG>=1000 shows that all have K exponent. I guess it’s an error and there should either be 5*10^3 or 5000*10^0. This should be corrrected.

2. Checking CROPDMG

summary(data$CROPDMG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.527   0.000 990.000

There are no values above 1000. I checked some top values and they look fine.

3. Other checks

Other checks show that there are plenty of cases where DMGEXP>0 but no DMG value is provided, both for PROP and CROP. I assume the DMG in such cases should be corrected to 1.

Correcting the data

Correcting numerical values

1. Replacing B exponent with M in one case

d1 <- subset(data, select = c(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
d1[d1$PROPDMGEXP=='B' & d1$PROPDMG==115,]$PROPDMGEXP <- 'M'

2. Setting exponent=0 for PROPDMG >= 1000

d1[d1$PROPDMG >= 1000,]$PROPDMGEXP <- 0

3. Setting exponents==’’ to 0

d1[d1$PROPDMGEXP=='',]$PROPDMGEXP <- 0
d1[d1$CROPDMGEXP=='',]$CROPDMGEXP <- 0

4. Replacing letters in exponents with numbers

library(plyr)
exponent_fix <- function (x) {
    mapvalues(x,
              from=c('+','-','?','h','H','k','K','m','M','b','B'),
              to = c(NA, NA, NA, '2','2','3','3','6','6','9','9')
    )
}
d1$PROPDMGEXP = exponent_fix(d1$PROPDMGEXP)
d1$CROPDMGEXP = exponent_fix(d1$CROPDMGEXP)

5. Eliminating observations with missing values

Some of these observations might be partially complete but I accept this. There are 21 observations to be removed.

d1 <- na.omit(d1)

6. Changing exponents from factors to numbers

d1$PROPDMGEXP <- as.numeric(levels(d1$PROPDMGEXP)[d1$PROPDMGEXP])
d1$CROPDMGEXP <- as.numeric(levels(d1$CROPDMGEXP)[d1$CROPDMGEXP])

7. Setting DMG=1 where DMG==0 but DMGEXP>0.

There are 197250 such cases for PROP and 261774 for CROP

d1[d1$PROPDMGEXP!=0 & d1$PROPDMG==0,]$PROPDMG <- 1
d1[d1$CROPDMGEXP!=0 & d1$CROPDMG==0,]$CROPDMG <- 1

8. Calculating numerical value of the damages

d1$PROPDMGVAL <- d1$PROPDMG*10^d1$PROPDMGEXP
d1$CROPDMGVAL <- d1$CROPDMG*10^d1$CROPDMGEXP

Standardizing the Event Type

Once we have the numerical values for damage calculate we can finally standardize the Event Type variable.

1. Ordering the EVTYPE categories by their impacts.

d_ft <- aggregate(FATALITIES~EVTYPE, data=d1, sum)
d_in <- aggregate(INJURIES~EVTYPE, data=d1, sum)
d_pr <- aggregate(PROPDMGVAL~EVTYPE, data=d1, sum)
d_cr <- aggregate(CROPDMGVAL~EVTYPE, data=d1, sum)

dft <- d_ft[order(d_ft$FATALITIES, decreasing=TRUE),]
din <- d_in[order(d_in$INJURIES, decreasing=TRUE),]
dpr <- d_pr[order(d_pr$PROPDMGVAL, decreasing=TRUE),]
dcr <- d_cr[order(d_cr$CROPDMGVAL, decreasing=TRUE),]

dft$EVTYPE <- as.character(dft$EVTYPE)
din$EVTYPE <- as.character(din$EVTYPE)
dpr$EVTYPE <- as.character(dpr$EVTYPE)
dcr$EVTYPE <- as.character(dcr$EVTYPE)

2. Selecting the top EVTYPE categories to be standardized.

Plotting EVTYPE’s impact on health (FATALITIES, INJURIES) and economy (Property Damage, Crop Damage).

plot_damage <- function(df,n=nrow(df)) {
    all_other <- if(n < nrow(df)) sum(df[(n+1):nrow(df),2]) else 0
    ymax = max(df[1,2], all_other)
    plot(df[1:n,2], main=names(df)[2], ylab="Damage", xlab="Event type", ylim=c(0, ymax))
    if(all_other > 0) abline(h=all_other)
}
par(mfrow=c(2,2))
n = 40
plot_damage(dft, n)
plot_damage(din, n)
plot_damage(dpr, n)
plot_damage(dcr, n)

The plot shows the damage for top 40 EVTYPE’s for each damage indicator. The horizontal line shows combined damage of all the other EVTYPE’s. It seems that it’s enough to standardize the top 40 EVTYPE’s to account for the major event types. The comined damage of the other EVTYPE’s shouldn’t change the top event types ordering.

Here are all the relevant events:

selected_events = sort(unique(c(
    dft[1:n,1],
    din[1:n,1],
    dpr[1:n,1],
    dcr[1:n,1])))
selected_events

##  [1] "AGRICULTURAL FREEZE"        "AVALANCHE"                 
##  [3] "BLIZZARD"                   "COASTAL FLOOD"             
##  [5] "COASTAL FLOODING"           "COLD"                      
##  [7] "COLD AND WET CONDITIONS"    "COLD/WIND CHILL"           
##  [9] "Damaging Freeze"            "DAMAGING FREEZE"           
## [11] "DENSE FOG"                  "DROUGHT"                   
## [13] "DUST STORM"                 "Early Frost"               
## [15] "EXCESSIVE HEAT"             "EXCESSIVE WETNESS"         
## [17] "EXTREME COLD"               "EXTREME COLD/WIND CHILL"   
## [19] "EXTREME HEAT"               "FLASH FLOOD"               
## [21] "FLASH FLOOD/FLOOD"          "FLASH FLOODING"            
## [23] "FLOOD"                      "FLOOD/FLASH FLOOD"         
## [25] "FLOOD/RAIN/WINDS"           "FOG"                       
## [27] "FREEZE"                     "FROST"                     
## [29] "FROST/FREEZE"               "GLAZE"                     
## [31] "HAIL"                       "HAILSTORM"                 
## [33] "HEAT"                       "HEAT WAVE"                 
## [35] "HEAVY RAIN"                 "HEAVY RAIN/SEVERE WEATHER" 
## [37] "HEAVY RAINS"                "HEAVY SNOW"                
## [39] "HEAVY SURF/HIGH SURF"       "HIGH SURF"                 
## [41] "HIGH WIND"                  "HIGH WINDS"                
## [43] "HURRICANE"                  "HURRICANE ERIN"            
## [45] "HURRICANE OPAL"             "HURRICANE/TYPHOON"         
## [47] "ICE"                        "ICE STORM"                 
## [49] "LANDSLIDE"                  "LIGHTNING"                 
## [51] "RIP CURRENT"                "RIP CURRENTS"              
## [53] "RIVER FLOOD"                "SEVERE THUNDERSTORM"       
## [55] "SEVERE THUNDERSTORM WINDS"  "STORM SURGE"               
## [57] "STORM SURGE/TIDE"           "STRONG WIND"               
## [59] "THUNDERSTORM WIND"          "THUNDERSTORM WINDS"        
## [61] "TORNADO"                    "TORNADOES, TSTM WIND, HAIL"
## [63] "TROPICAL STORM"             "TSTM WIND"                 
## [65] "TSTM WIND/HAIL"             "TSUNAMI"                   
## [67] "TYPHOON"                    "UNSEASONABLY WARM AND DRY" 
## [69] "URBAN/SML STREAM FLD"       "WILD FIRES"                
## [71] "WILD/FOREST FIRE"           "WILDFIRE"                  
## [73] "WIND"                       "WINTER STORM"              
## [75] "WINTER WEATHER"             "WINTER WEATHER/MIX"

There are 76 of them. We write them into file and create a dictionary where each EVTYPE on a list have a standardized Event Type. Additionally we add “Other” event type to simplify subsequent calculations.

3. Creating EVTYPE list, standardizing it and creating dictionary from the former to the latter

I wrote the 76 EVTYPE’s into a file and created a dictionary by assigining each of these EVTYPE’s a standardized name.

write.table(selected_events, file="dict_keys.txt", row.names=F, col.names=F)
write.table(events_clean, file="events_clean.txt", row.names=F, col.names=F)
dict = read.table(file="dict.txt", stringsAsFactors = F, col.names=c("EVTYPE","EVENT_TYPE"))
dict

##                        EVTYPE              EVENT_TYPE
## 1         AGRICULTURAL FREEZE            Frost/Freeze
## 2                   AVALANCHE               Avalanche
## 3                    BLIZZARD                Blizzard
## 4               COASTAL FLOOD           Coastal Flood
## 5            COASTAL FLOODING           Coastal Flood
## 6                        COLD         Cold/Wind Chill
## 7     COLD AND WET CONDITIONS         Cold/Wind Chill
## 8             COLD/WIND CHILL         Cold/Wind Chill
## 9             Damaging Freeze            Frost/Freeze
## 10            DAMAGING FREEZE            Frost/Freeze
## 11                  DENSE FOG               Dense Fog
## 12                    DROUGHT                 Drought
## 13                 DUST STORM              Dust Storm
## 14                Early Frost            Frost/Freeze
## 15             EXCESSIVE HEAT          Excessive Heat
## 16          EXCESSIVE WETNESS       Excessive Wetness
## 17               EXTREME COLD Extreme Cold/Wind Chill
## 18    EXTREME COLD/WIND CHILL Extreme Cold/Wind Chill
## 19               EXTREME HEAT          Excessive Heat
## 20                FLASH FLOOD             Flash Flood
## 21          FLASH FLOOD/FLOOD             Flash Flood
## 22             FLASH FLOODING             Flash Flood
## 23                      FLOOD                   Flood
## 24          FLOOD/FLASH FLOOD             Flash Flood
## 25           FLOOD/RAIN/WINDS                   Flood
## 26                        FOG               Dense Fog
## 27                      FROST            Frost/Freeze
## 28                     FREEZE            Frost/Freeze
## 29               FROST/FREEZE            Frost/Freeze
## 30                      GLAZE               Ice Storm
## 31                       HAIL                    Hail
## 32                  HAILSTORM                    Hail
## 33                       HEAT                    Heat
## 34                  HEAT WAVE                    Heat
## 35                 HEAVY RAIN              Heavy Rain
## 36  HEAVY RAIN/SEVERE WEATHER              Heavy Rain
## 37                HEAVY RAINS              Heavy Rain
## 38                 HEAVY SNOW              Heavy Snow
## 39       HEAVY SURF/HIGH SURF               High Surf
## 40                  HIGH SURF               High Surf
## 41                  HIGH WIND               High Wind
## 42                 HIGH WINDS               High Wind
## 43                  HURRICANE     Hurricane (Typhoon)
## 44             HURRICANE ERIN     Hurricane (Typhoon)
## 45             HURRICANE OPAL     Hurricane (Typhoon)
## 46          HURRICANE/TYPHOON     Hurricane (Typhoon)
## 47                        ICE               Ice Storm
## 48                  ICE STORM               Ice Storm
## 49                  LANDSLIDE               Landslide
## 50                  LIGHTNING               Lightning
## 51                RIP CURRENT             Rip Current
## 52               RIP CURRENTS             Rip Current
## 53                RIVER FLOOD                   Flood
## 54        SEVERE THUNDERSTORM       Thunderstorm Wind
## 55  SEVERE THUNDERSTORM WINDS       Thunderstorm Wind
## 56                STORM SURGE        Storm Surge/Tide
## 57           STORM SURGE/TIDE        Storm Surge/Tide
## 58                STRONG WIND               High Wind
## 59          THUNDERSTORM WIND       Thunderstorm Wind
## 60         THUNDERSTORM WINDS       Thunderstorm Wind
## 61                    TORNADO                 Tornado
## 62 TORNADOES, TSTM WIND, HAIL                 Tornado
## 63             TROPICAL STORM          Tropical Storm
## 64                  TSTM WIND       Thunderstorm Wind
## 65             TSTM WIND/HAIL       Thunderstorm Wind
## 66                    TSUNAMI                 Tsunami
## 67                    TYPHOON     Hurricane (Typhoon)
## 68  UNSEASONABLY WARM AND DRY                    Heat
## 69       URBAN/SML STREAM FLD             Flash Flood
## 70                 WILD FIRES                Wildfire
## 71           WILD/FOREST FIRE                Wildfire
## 72                   WILDFIRE                Wildfire
## 73                       WIND               High Wind
## 74               WINTER STORM            Winter Storm
## 75             WINTER WEATHER          Winter Weather
## 76         WINTER WEATHER/MIX          Winter Weather
## 77                      other                   other

Confirming that the dictionary contains all the selected EVTYPE’s:

all(sapply(selected_events, function(x) x %in% dict$EVTYPE))

## [1] TRUE

4. Standardizing event types in the data

EVENT_TYPE column contains the standardized name for each EVTYPE in the data.

standardize_event_types <- function(df, n = nrow(df)) {
    for(i in 1:nrow(df)) {
        if(!any(dict$EVTYPE==df[i,"EVTYPE"])) {
            df[i,"EVTYPE"] = "other"
        }
    }
    transform(df[1:n,], EVENT_TYPE = mapvalues(df$EVTYPE, dict$EVTYPE, dict$EVENT_TYPE))
}
dftst <- standardize_event_types(dft)
dinst <- standardize_event_types(din)
dprst <- standardize_event_types(dpr)
dcrst <- standardize_event_types(dcr)

Final data aggregating

Now I aggregate event data and sum up health and economic damages. Event types are sorted by their impact. The “Other” categories are moved to the last rows of the data frames.

dftf <- aggregate(FATALITIES~EVENT_TYPE, data=dftst, sum)
dinf <- aggregate(INJURIES~EVENT_TYPE, data=dinst, sum)
dprf <- aggregate(PROPDMGVAL~EVENT_TYPE, data=dprst, sum)
dcrf <- aggregate(CROPDMGVAL~EVENT_TYPE, data=dcrst, sum)
# final sorting
dftfs <- dftf[order(dftf$FATALITIES, decreasing=TRUE),]
dinfs <- dinf[order(dinf$INJURIES, decreasing=TRUE),]
dprfs <- dprf[order(dprf$PROPDMGVAL, decreasing=TRUE),]
dcrfs <- dcrf[order(dcrf$CROPDMGVAL, decreasing=TRUE),]
# moving "other" to the last row
put_other_at_the_end <- function(df) {
    n = nrow(df)
    other = which(df$EVENT_TYPE=="other")
    if(other>1 & other<n)
        index = c(1:(other-1),(other+1):n,other)
    else if (other==1)
        index = c(2:n,other)
    else
        index = 1:n
    df[index,]
}
dftfs <- put_other_at_the_end(dftfs)
dinfs <- put_other_at_the_end(dinfs)
dprfs <- put_other_at_the_end(dprfs)
dcrfs <- put_other_at_the_end(dcrfs)

Let’s plot the most damageing event types for each damage indicator.

par(mfrow=c(2,2))
barplot_with_the_last_record <- function(df, col, n=3) {
    m = nrow(df)
    if(n==m)
        barplot(height = df[,col], names.arg = df$EVENT_TYPE, cex.names=0.7, main=col, ylab="Damage")
    else
        barplot(height = df[c(1:n,m),col], names.arg = df$EVENT_TYPE[c(1:n,m)], cex.names=0.7, main=col, ylab="Damage")
}
barplot_with_the_last_record(dftfs,"FATALITIES")
barplot_with_the_last_record(dinfs,"INJURIES")
barplot_with_the_last_record(dprfs,"PROPDMGVAL")
barplot_with_the_last_record(dcrfs,"CROPDMGVAL")

The plot shows the resulting ordering of event types. We see that the nonstandardized EVTYPE’s combined under “other” category can’t change the top ordering of event types.