This analysis identifies those weather events since 1993 that have, on average caused the most damage in four categories: Fatalities, Injuries, Property Damage, and Crop Damage. The analysis uses National Climate Date Center Storm Events data since 1993 (earlier years are available, but as shown below, are of inconsistent completeness until 1993). All calculations combine the impacts for the entire US (including outlying territories), and hence might be of limited utility for government managers concerned with a particular state or county. However, the code shown here could be easily modified to restrict the data to a single state, county, or territory.
As is common in such analyses, much of the code is devoted to preparing the data for analysis. Although the analysis looked at the mean impacts across all occurrences of event types, the highest averages for all four categories resulted from single events. Ultimately, we find a single Heat Wave caused the largest number of injuries, Tornadoes, Thunderstorm Wind, and Hail caused the most deaths and most extensive property damage.
The first order of business is to read in the data (rather time consuming for a data frame with nearly 1 million rows). I then select out the columns to be used in further data preparation.
setwd("~/Dropbox/In process/MyRWork")
library(dplyr)
storms <- read.csv("repdata-data-StormData.csv")
mydata <- select(storms, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
With such a large data table, I expected issues with high dimensionality. Therefore before going further, we examine several variables. In particular, I wanted to know the number and frequency of the different types of weather events, as well as a sense of the distributions of the 4 impact-related variables.
As we see below, there are 985 different types of weather events; such a large number creates a challenge for visual summaries. Below I created a Pareto chart to see which events were most common.
str(mydata$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
# explore the EVTYPE for issues
t <- sort(table(mydata$EVTYPE), decreasing = TRUE)
head(t,20)
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
th<- head(t,25)
library(qcc)
pc <- pareto.chart(th, main="Frequency of Weather Events")
summary(mydata$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
summary(mydata$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
table(mydata$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(mydata$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Based on the assignment instructions, exploration of the data, and discussions in the Course discussion list, I wanted to be able to examine the data by year (to be able to subset the original data to include only years with standardized and reliable data).
Additionally, the economic damage data needs preprocessing because the dollar amounts vary in order of magnitude. For example some amounts are in dollars, while other are in thousands, millions, or billions. Still other amounts are of unclear magnitude.
library(stringr)
# not elegant but it works...
d<- mydata$BGN_DATE
l <-str_locate(d," ")-4
dt <-str_sub(d,l[,1],l[,1]+3)
mydata <- mutate(mydata, year=dt)
Next, be sure that the dollar amount scale codes have no embedded blanks.
mydata$PROPDMGEXP <- str_trim(mydata$PROPDMGEXP, side="both")
mydata$CROPDMGEXP <- str_trim(mydata$CROPDMGEXP, side="both")
As noted above, there was reason to think that early-years data might be of variable quality. Two quick graphs reveal that there were few non-zero entries before 1993.
plot(mydata$year, mydata$PROPDMG,
main="Property Damage Figures by year")
plot(mydata$year, mydata$CROPDMG,
main="Crop Damage Figures by year")
Based on these graphs, I then chose to work only with data from 1993 onwards.
workdf <- filter(mydata, year > "1992")
The next step in the data preparation is to use the codes in the “DMGEXP” variables to re-express economic damage in dollar units. In other words, to convert a number like 2M to $2,000,000. To do this, I used a multiplier table as suggeted in the assignment discussion. To determine the codes needed for the table, I tabulated the codes for both Property and Crop damage.
table(workdf$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 313139 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 392674 7 8557
table(workdf$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 430854 7 19 1 9 21 281832 1 1994
Based on these lists, I established a conversion table for each code used, and then create 2 new variables for the costs to property and crops.
mt <- c("0" = 1, "K" = 1000, "k"=1000,
"M" = 1000000, "m" = 1000000, "B" = 1000000000,
" "=1, "-"= 1,"?"=1, "+"=1, "1"=1, "2"=1, "3"=1, "4"=1,
"5"=1, "6"=1, "7"=1, "8"=1, "h"=1, "H"=1)
workdf <- mutate(workdf, propcost=PROPDMG*mt[workdf$PROPDMGEXP])
workdf <- mutate(workdf, cropcost=CROPDMG*mt[workdf$CROPDMGEXP])
With all of the cleaning steps completed, we now create a new working data frame containing the mean damages caused by each of the Event Types. There were several reasonable approaches to the definition of “most harmful” and “greatest economic consequences”. After due consideration, I decided to use the mean, rather than median for each event type, or the maximum harm caused by any single weather event.
by_ev <- group_by(workdf, EVTYPE)
m_harm <- summarize(by_ev,
mfatal = mean(FATALITIES),
minjur = mean(INJURIES),
mprop = mean(propcost),
mcrop = mean(cropcost),
events = n())
With all of the mean values computed, I decided to “slim down” the data frames yet again by filtering out any event type that caused no harm on average over all of the years observed, and then sort four subsets by the extent of the damage done.
econ_harm <- filter(m_harm, (mprop+mcrop > 0))
human_harm <- filter(m_harm, (mfatal + minjur) > 0)
# sort by magnitude of damage
attach(human_harm)
injury <- human_harm[order(-minjur),]
death <- human_harm[order(-mfatal),]
detach(human_harm)
attach(econ_harm)
property <- econ_harm[order(-mprop),]
crop <- econ_harm[order(-mcrop),]
detach(econ_harm)
Below are four tables listing the 5 event types doing the most harm in each of the four categories of injuries, fatalities, property damage and crop damage.
i <- head(select(injury, EVTYPE, minjur, events))
d <- head(select(death, EVTYPE, mfatal, events))
p <- head(select(property, EVTYPE, mprop, events))
c <- head(select(crop, EVTYPE, mcrop, events))
library(xtable)
## Warning: package 'xtable' was built under R version 3.1.3
xti <- xtable(i, caption="Top 5 most damaging weather events: Injuries",
digits = 0 )
print(xti, type="html")
| EVTYPE | minjur | events | |
|---|---|---|---|
| 1 | Heat Wave | 70 | 1 |
| 2 | TROPICAL STORM GORDON | 43 | 1 |
| 3 | WILD FIRES | 38 | 4 |
| 4 | THUNDERSTORMW | 27 | 1 |
| 5 | HIGH WIND AND SEAS | 20 | 1 |
| 6 | SNOW/HIGH WINDS | 18 | 2 |
xtd <- xtable(d,
caption="Top 5 most damaging weather events: Fatalities",
digits = 1 )
print(xtd, type="html")
| EVTYPE | mfatal | events | |
|---|---|---|---|
| 1 | TORNADOES, TSTM WIND, HAIL | 25.0 | 1 |
| 2 | COLD AND SNOW | 14.0 | 1 |
| 3 | TROPICAL STORM GORDON | 8.0 | 1 |
| 4 | RECORD/EXCESSIVE HEAT | 5.7 | 3 |
| 5 | EXTREME HEAT | 4.4 | 22 |
| 6 | HEAT WAVE DROUGHT | 4.0 | 1 |
xtp <- xtable(p,
caption="Top 5 most damaging weather events: Property Damage",
digits = 1 )
print(xtp, type="html")
| EVTYPE | mprop | events | |
|---|---|---|---|
| 1 | TORNADOES, TSTM WIND, HAIL | 1600000000.0 | 1 |
| 2 | HURRICANE OPAL/HIGH WINDS | 100000000.0 | 1 |
| 3 | WINTER STORM HIGH WINDS | 60000000.0 | 1 |
| 4 | HIGH WINDS/COLD | 22100000.0 | 5 |
| 5 | Heavy Rain/High Surf | 13500000.0 | 1 |
| 6 | HIGH WINDS HEAVY RAINS | 7500000.0 | 1 |
xtc <- xtable(c,
caption="Top 5 most damaging weather events: Crop Damage",
digits = 1 )
print(xtc, type="html")
| EVTYPE | mcrop | events | |
|---|---|---|---|
| 1 | HURRICANE OPAL/HIGH WINDS | 10000000.0 | 1 |
| 2 | WINTER STORM HIGH WINDS | 5000000.0 | 1 |
| 3 | TORNADOES, TSTM WIND, HAIL | 2500000.0 | 1 |
| 4 | Heavy Rain/High Surf | 1500000.0 | 1 |
| 5 | HIGH WINDS/COLD | 1400000.0 | 5 |
| 6 | DUST STORM/HIGH WINDS | 500000.0 | 1 |
This analysis illustrates the value of “deep dives” into data. While it is very important to understand, say, the risks of hurricane-force winds, there are certainly parts that never see a hurricane. Additionally, risks are seasonal so that a government official (or private business owner) will have different concerns in different months.
One logical next phase of the analysis would be to create an interactive tool that would allow for user-directed filtering by location, season, and so on.