Synopsis

This analysis identifies those weather events since 1993 that have, on average caused the most damage in four categories: Fatalities, Injuries, Property Damage, and Crop Damage. The analysis uses National Climate Date Center Storm Events data since 1993 (earlier years are available, but as shown below, are of inconsistent completeness until 1993). All calculations combine the impacts for the entire US (including outlying territories), and hence might be of limited utility for government managers concerned with a particular state or county. However, the code shown here could be easily modified to restrict the data to a single state, county, or territory.

As is common in such analyses, much of the code is devoted to preparing the data for analysis. Although the analysis looked at the mean impacts across all occurrences of event types, the highest averages for all four categories resulted from single events. Ultimately, we find a single Heat Wave caused the largest number of injuries, Tornadoes, Thunderstorm Wind, and Hail caused the most deaths and most extensive property damage.

Data Processing

The first order of business is to read in the data (rather time consuming for a data frame with nearly 1 million rows). I then select out the columns to be used in further data preparation.

setwd("~/Dropbox/In process/MyRWork")

library(dplyr)

storms <- read.csv("repdata-data-StormData.csv")
mydata <- select(storms, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES,
                 PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

With such a large data table, I expected issues with high dimensionality. Therefore before going further, we examine several variables. In particular, I wanted to know the number and frequency of the different types of weather events, as well as a sense of the distributions of the 4 impact-related variables.

As we see below, there are 985 different types of weather events; such a large number creates a challenge for visual summaries. Below I created a Pareto chart to see which events were most common.

str(mydata$EVTYPE)
##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
# explore the EVTYPE for issues
t <- sort(table(mydata$EVTYPE), decreasing = TRUE)
head(t,20)
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE 
##                     3392                     2761
th<- head(t,25)

library(qcc)
pc <- pareto.chart(th, main="Frequency of Weather Events")

summary(mydata$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000
summary(mydata$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000
table(mydata$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(mydata$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Based on the assignment instructions, exploration of the data, and discussions in the Course discussion list, I wanted to be able to examine the data by year (to be able to subset the original data to include only years with standardized and reliable data).

Additionally, the economic damage data needs preprocessing because the dollar amounts vary in order of magnitude. For example some amounts are in dollars, while other are in thousands, millions, or billions. Still other amounts are of unclear magnitude.

First create a new variable for the year of the event

library(stringr)
# not elegant but it works...
d<- mydata$BGN_DATE
l <-str_locate(d," ")-4
dt <-str_sub(d,l[,1],l[,1]+3)
mydata <- mutate(mydata, year=dt)

Preliminary step for conversion of dollar damage amounts

Next, be sure that the dollar amount scale codes have no embedded blanks.

mydata$PROPDMGEXP <- str_trim(mydata$PROPDMGEXP, side="both")
mydata$CROPDMGEXP <- str_trim(mydata$CROPDMGEXP, side="both")

investigate patterns of damage costs to find reasonable starting year

As noted above, there was reason to think that early-years data might be of variable quality. Two quick graphs reveal that there were few non-zero entries before 1993.

plot(mydata$year, mydata$PROPDMG, 
     main="Property Damage Figures by year")

plot(mydata$year, mydata$CROPDMG,
     main="Crop Damage Figures by year")

Based on these graphs, I then chose to work only with data from 1993 onwards.

workdf <- filter(mydata, year > "1992")

The next step in the data preparation is to use the codes in the “DMGEXP” variables to re-express economic damage in dollar units. In other words, to convert a number like 2M to $2,000,000. To do this, I used a multiplier table as suggeted in the assignment discussion. To determine the codes needed for the table, I tabulated the codes for both Property and Crop damage.

table(workdf$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 313139      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 392674      7   8557
table(workdf$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 430854      7     19      1      9     21 281832      1   1994

Based on these lists, I established a conversion table for each code used, and then create 2 new variables for the costs to property and crops.

mt <- c("0" = 1, "K" = 1000, "k"=1000,
        "M" = 1000000, "m" = 1000000, "B" = 1000000000,
        " "=1, "-"= 1,"?"=1, "+"=1, "1"=1, "2"=1, "3"=1, "4"=1,
        "5"=1, "6"=1, "7"=1, "8"=1, "h"=1, "H"=1)
workdf <- mutate(workdf, propcost=PROPDMG*mt[workdf$PROPDMGEXP])
workdf <- mutate(workdf, cropcost=CROPDMG*mt[workdf$CROPDMGEXP])

With all of the cleaning steps completed, we now create a new working data frame containing the mean damages caused by each of the Event Types. There were several reasonable approaches to the definition of “most harmful” and “greatest economic consequences”. After due consideration, I decided to use the mean, rather than median for each event type, or the maximum harm caused by any single weather event.

by_ev <- group_by(workdf, EVTYPE)
m_harm <- summarize(by_ev,
        mfatal = mean(FATALITIES),
        minjur = mean(INJURIES),
        mprop = mean(propcost),
        mcrop = mean(cropcost),
        events = n())

More data reduction

With all of the mean values computed, I decided to “slim down” the data frames yet again by filtering out any event type that caused no harm on average over all of the years observed, and then sort four subsets by the extent of the damage done.

econ_harm <- filter(m_harm, (mprop+mcrop > 0))
human_harm <- filter(m_harm, (mfatal + minjur) > 0)

# sort by magnitude of damage
attach(human_harm)
injury <- human_harm[order(-minjur),]
death <- human_harm[order(-mfatal),]
detach(human_harm)

attach(econ_harm)
property <- econ_harm[order(-mprop),]
crop <- econ_harm[order(-mcrop),]
detach(econ_harm)

Results

Below are four tables listing the 5 event types doing the most harm in each of the four categories of injuries, fatalities, property damage and crop damage.

i <- head(select(injury, EVTYPE, minjur, events))
d <- head(select(death, EVTYPE, mfatal, events))

p <- head(select(property, EVTYPE, mprop, events))
c <- head(select(crop, EVTYPE, mcrop, events))


library(xtable)
## Warning: package 'xtable' was built under R version 3.1.3
xti <- xtable(i, caption="Top 5 most damaging weather events: Injuries",
        digits = 0 )
print(xti, type="html")
Top 5 most damaging weather events: Injuries
EVTYPE minjur events
1 Heat Wave 70 1
2 TROPICAL STORM GORDON 43 1
3 WILD FIRES 38 4
4 THUNDERSTORMW 27 1
5 HIGH WIND AND SEAS 20 1
6 SNOW/HIGH WINDS 18 2
xtd <- xtable(d, 
        caption="Top 5 most damaging weather events: Fatalities", 
        digits = 1 )
print(xtd, type="html")
Top 5 most damaging weather events: Fatalities
EVTYPE mfatal events
1 TORNADOES, TSTM WIND, HAIL 25.0 1
2 COLD AND SNOW 14.0 1
3 TROPICAL STORM GORDON 8.0 1
4 RECORD/EXCESSIVE HEAT 5.7 3
5 EXTREME HEAT 4.4 22
6 HEAT WAVE DROUGHT 4.0 1
xtp <- xtable(p, 
        caption="Top 5 most damaging weather events: Property Damage",
        digits = 1 )
print(xtp, type="html")
Top 5 most damaging weather events: Property Damage
EVTYPE mprop events
1 TORNADOES, TSTM WIND, HAIL 1600000000.0 1
2 HURRICANE OPAL/HIGH WINDS 100000000.0 1
3 WINTER STORM HIGH WINDS 60000000.0 1
4 HIGH WINDS/COLD 22100000.0 5
5 Heavy Rain/High Surf 13500000.0 1
6 HIGH WINDS HEAVY RAINS 7500000.0 1
xtc <- xtable(c, 
        caption="Top 5 most damaging weather events: Crop Damage",
        digits = 1 )
print(xtc, type="html")
Top 5 most damaging weather events: Crop Damage
EVTYPE mcrop events
1 HURRICANE OPAL/HIGH WINDS 10000000.0 1
2 WINTER STORM HIGH WINDS 5000000.0 1
3 TORNADOES, TSTM WIND, HAIL 2500000.0 1
4 Heavy Rain/High Surf 1500000.0 1
5 HIGH WINDS/COLD 1400000.0 5
6 DUST STORM/HIGH WINDS 500000.0 1

Closing Thoughts

This analysis illustrates the value of “deep dives” into data. While it is very important to understand, say, the risks of hurricane-force winds, there are certainly parts that never see a hurricane. Additionally, risks are seasonal so that a government official (or private business owner) will have different concerns in different months.

One logical next phase of the analysis would be to create an interactive tool that would allow for user-directed filtering by location, season, and so on.