This report, prepared for the Coursera class Reproducible Research, explores the NOAA Storm Database and uses it to determine which types of severe weather events are more costly from a human health and an economic perspective.
We found evidence that, from a total human health standpoint, the worst type of weather event (by far) is the tornado, which has cost over 650 lives and injured over 7000 people in the last 3 recorded years alone. From an economic standpoint, the most cosly events are floods and tornadoes, each one of which has caused the country over $10 billion in property and crops damage in the last 3 recorded years.
We are asked to show how to reproduce all the steps in the analysis, starting from the very raw data. First, I uncompressed and loaded the NOAA file as requested from the working directory. Then I examined its contents and structure, which are displayed here:
library(ggplot2)
library(reshape2)
#Open file and display contents
dat <- read.csv(bzfile("./repdata-data-StormData.csv.bz2"))
str(dat)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels ""," "," "," ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
This is where the data processing steps begin. We were asked to explain why we are doing any transformations, so:
#Unit conversions (to plain dollars)
dat$PROPDMG[grep('k|K',dat$PROPDMGEXP)] <- dat$PROPDMG[grep('k|K',dat$PROPDMGEXP)] * 10^3
dat$CROPDMG[grep('k|K',dat$CROPDMGEXP)] <- dat$CROPDMG[grep('k|K',dat$CROPDMGEXP)] * 10^3
dat$PROPDMG[grep('m|M',dat$PROPDMGEXP)] <- dat$PROPDMG[grep('m|M',dat$PROPDMGEXP)] * 10^6
dat$CROPDMG[grep('m|M',dat$CROPDMGEXP)] <- dat$CROPDMG[grep('m|M',dat$CROPDMGEXP)] * 10^6
dat$PROPDMG[grep('b|B',dat$PROPDMGEXP)] <- dat$PROPDMG[grep('b|B',dat$PROPDMGEXP)] * 10^9
dat$CROPDMG[grep('b|B',dat$CROPDMGEXP)] <- dat$CROPDMG[grep('b|B',dat$CROPDMGEXP)] * 10^9
dat$PROPDMG[grep("^[[:digit:]]+$", dat$PROPDMGEXP)] <-
dat$PROPDMG[grep("^[[:digit:]]+$", dat$PROPDMGEXP)] *
(10^as.numeric(as.character(dat$PROPDMGEXP[grep("^[[:digit:]]+$", dat$PROPDMGEXP)])))
dat$CROPDMG[grep("^[[:digit:]]+$", dat$CROPDMGEXP)] <-
dat$CROPDMG[grep("^[[:digit:]]+$", dat$CROPDMGEXP)] *
(10^as.numeric(as.character(dat$CROPDMGEXP[grep("^[[:digit:]]+$", dat$CROPDMGEXP)])))
#Calculate total damage (to property & crops) in millions of dollars
dat$DAMAGE <- (dat$PROPDMG + dat$CROPDMG) / (10^6)
#Format dates
dat$BGN_DATE <- as.Date(sub("^(.*?)\\s.*","\\1",as.character(dat$BGN_DATE)), format= "%m/%d/%Y")
dat$END_DATE <- as.Date(sub("^(.*?)\\s.*","\\1",as.character(dat$END_DATE)), format= "%m/%d/%Y")
#Check the changes
summary(dat[,c("BGN_DATE","END_DATE","DAMAGE")])
## BGN_DATE END_DATE DAMAGE
## Min. :1950-01-03 Min. :1986-04-10 Min. : 0
## 1st Qu.:1995-04-20 1st Qu.:2000-09-01 1st Qu.: 0
## Median :2002-03-18 Median :2005-04-30 Median : 0
## Mean :1998-12-27 Mean :2004-09-26 Mean : 1
## 3rd Qu.:2007-07-28 3rd Qu.:2008-08-10 3rd Qu.: 0
## Max. :2011-11-30 Max. :2011-11-30 Max. :115032
The events in the database start in the year 1950 and end in November 2011. I decided to keep only those records from the last three years for this analysis (since December 2008) because we know that the data quality is better in more recent times (and 3 years is a large enough sample.) I will also use this opportunity to discard the variables that we dont need anymore. I will keep the event dates, the event type, and the following target metrics for analysis:
#Reduce the dataset
dat <- dat[,c("BGN_DATE","END_DATE","EVTYPE","DAMAGE","FATALITIES","INJURIES")]
dat <- dat[dat$BGN_DATE>as.Date("12/01/2008", format= "%m/%d/%Y"),]
#Validate the results
summary(dat[,c("BGN_DATE","END_DATE")])
## BGN_DATE END_DATE
## Min. :2008-12-02 Min. :2008-12-02
## 1st Qu.:2009-08-14 1st Qu.:2009-08-14
## Median :2010-07-19 Median :2010-07-19
## Mean :2010-07-12 Mean :2010-07-12
## 3rd Qu.:2011-05-12 3rd Qu.:2011-05-12
## Max. :2011-11-30 Max. :2011-11-30
dim(dat)
## [1] 159214 6
Continuing with the pre-processing of the data in R, I summarized the metrics by type of event (field EVTYPE.) I calculated the total number of fatalities, total number of injuries, and total dollar amount in damages for each type of event during the study period. Also calculated the averages per event for these 3 metrics.
datSum <- aggregate( cbind(FATALITIES, INJURIES, DAMAGE) ~ EVTYPE, data=dat, FUN=sum)
datMean <- aggregate( cbind(FATALITIES, INJURIES, DAMAGE) ~ EVTYPE, data=dat, FUN=mean)
dim(datSum)
## [1] 47 4
After these transformation and aggregation steps, the data preparation work is complete. We now move to the exploratory data analysis phase of the project.
Our goal is to find the types of events that have the largest impact. Noticing the large number of events that exist in the database (looking at the summary counts above), it became clear that I need to focus on a few (the big ones.) To determine which ones, I created lists of “Top 10 Events” by various criteria (e.g. by total fatalities, by average number of fatalities per individual event, by total injuries, etc.) This “top 10 list” will give us an initial idea of which events are more costly. Here are all the lists:
#Define function to sort events and pick the top ones given the selection criteria
getTopNEvents <- function (data, target, N=10, decreasing=T) {
data[order(data[,target], decreasing=decreasing),][1:N,]
}
#Top 10 list ranked by the average number of fatalities per event
getTopNEvents(datMean, "FATALITIES")
## EVTYPE FATALITIES INJURIES DAMAGE
## 43 TSUNAMI 2.06250 8.062500 8.430e+00
## 34 RIP CURRENT 0.72222 0.567901 6.173e-06
## 2 AVALANCHE 0.48000 0.390000 1.273e-02
## 19 HEAT 0.27883 1.392034 3.375e-03
## 5 COLD/WIND CHILL 0.21239 0.005900 5.897e-03
## 11 EXCESSIVE HEAT 0.20147 0.523342 3.030e-03
## 22 HIGH SURF 0.17842 0.248963 2.730e-01
## 32 MARINE STRONG WIND 0.16129 0.225806 1.204e-02
## 40 TORNADO 0.13171 1.465510 2.342e+00
## 12 EXTREME COLD/WIND CHILL 0.04688 0.002232 1.570e-02
#Top 10 list ranked by the total number of fatalities in the last 3 years
getTopNEvents(datSum, "FATALITIES")
## EVTYPE FATALITIES INJURIES DAMAGE
## 40 TORNADO 653 7266 11611.233
## 13 FLASH FLOOD 168 239 2918.843
## 19 HEAT 133 664 1.610
## 14 FLOOD 122 147 12713.085
## 34 RIP CURRENT 117 92 0.001
## 29 LIGHTNING 90 577 164.247
## 39 THUNDERSTORM WIND 89 892 2183.088
## 11 EXCESSIVE HEAT 82 213 1.233
## 5 COLD/WIND CHILL 72 2 1.999
## 2 AVALANCHE 48 39 1.273
It stands out that the worst weather event by average fatalities (the TSUNAMI) is not even in the second list by total fatalities. Although tsunamies seem to be the most dangerous events when they occur, they happen with very low frequency. When we look at the total number of fatalities over the last 3 years, we see that the TORNADO has cost us many more lives. TORNADOS are in fact the biggest threat among all these weather events followed by FLASH FLOODS. In third place we have HEAT which happens to be in the first list too.
Now, we'll look at injuries:
#Top 10 list ranked by the average number of injuries per event
getTopNEvents(datMean, "INJURIES")
## EVTYPE FATALITIES INJURIES DAMAGE
## 43 TSUNAMI 2.06250 8.0625 8.430e+00
## 40 TORNADO 0.13171 1.4655 2.342e+00
## 19 HEAT 0.27883 1.3920 3.375e-03
## 34 RIP CURRENT 0.72222 0.5679 6.173e-06
## 11 EXCESSIVE HEAT 0.20147 0.5233 3.030e-03
## 2 AVALANCHE 0.48000 0.3900 1.273e-02
## 9 DUST DEVIL 0.00000 0.3333 9.715e-03
## 22 HIGH SURF 0.17842 0.2490 2.730e-01
## 29 LIGHTNING 0.03756 0.2408 6.855e-02
## 32 MARINE STRONG WIND 0.16129 0.2258 1.204e-02
#Top 10 list ranked by the total number of injuries in the last 3 years
getTopNEvents(datSum, "INJURIES")
## EVTYPE FATALITIES INJURIES DAMAGE
## 40 TORNADO 653 7266 11611.233
## 39 THUNDERSTORM WIND 89 892 2183.088
## 19 HEAT 133 664 1.610
## 29 LIGHTNING 90 577 164.247
## 13 FLASH FLOOD 168 239 2918.843
## 11 EXCESSIVE HEAT 82 213 1.233
## 45 WILDFIRE 10 208 998.956
## 14 FLOOD 122 147 12713.085
## 18 HAIL 0 143 5830.775
## 43 TSUNAMI 33 129 134.879
One can notice similarities between the fatality lists and the injury lists. TSUNAMIES are again at the top when ranking by average injuries… and it barely made it to the list prioritizing by total injuries in the last 3 years. Interestinly, TORNADOS remain at the top of the total injury list and now take the second place in the list by average industries. So, tornados continue to rank as the worst threat to human life and health.
Now, we'll look at economic damage aspect:
#Top 10 list ranked by the average damage amount per event
getTopNEvents(datMean, "DAMAGE")
## EVTYPE FATALITIES INJURIES DAMAGE
## 43 TSUNAMI 2.062500 8.062500 8.4299
## 24 HURRICANE 0.000000 0.200000 4.2060
## 40 TORNADO 0.131706 1.465510 2.3419
## 14 FLOOD 0.014592 0.017582 1.5205
## 42 TROPICAL STORM 0.024752 0.004950 0.8969
## 16 FROST/FREEZE 0.000000 0.000000 0.8346
## 45 WILDFIRE 0.008137 0.169243 0.8128
## 25 ICE STORM 0.007160 0.009547 0.7278
## 37 STORM SURGE/TIDE 0.000000 0.000000 0.5814
## 28 LANDSLIDE 0.004926 0.044335 0.4061
#Top 10 list ranked by the total damage amount in the last 3 years
getTopNEvents(datSum, "DAMAGE")
## EVTYPE FATALITIES INJURIES DAMAGE
## 14 FLOOD 122 147 12713.1
## 40 TORNADO 653 7266 11611.2
## 18 HAIL 0 143 5830.8
## 13 FLASH FLOOD 168 239 2918.8
## 39 THUNDERSTORM WIND 89 892 2183.1
## 45 WILDFIRE 10 208 999.0
## 16 FROST/FREEZE 0 0 441.5
## 25 ICE STORM 3 4 305.0
## 23 HIGH WIND 15 51 267.8
## 42 TROPICAL STORM 5 1 181.2
TSUNAMIES and HURACANES cause a lot of damage when they occur, but aren't as common as other events. When we account for the frequency, FLOODS and TORNADOS turn out to cost us more than 1500x the amount of tsunamies and huracanes.
After reviewing these lists, we are ready to present the results.
The first question that we need to answer is: “Across the United States, which types of events are most harmful with respect to population health?”
The analysis doesn't need to be complicated. In order to answer this question, I ploted the worst 3 events by total number of fatalities and the worst 3 by total injuries altogether. Here are the results: The event that stands out the most from a human health impact perspective is the TORNADO.
#Generate final list of events (top 3) for the "human damage"" metrics
datH <- c(as.character(getTopNEvents(datSum, "FATALITIES", N=3)[,"EVTYPE"]),
as.character(getTopNEvents(datSum, "INJURIES", N=3)[,"EVTYPE"])
)
datH <- data.frame(EVTYPE=unique(datH))
#Add back all the attributes to the event IDs
datH <- merge(datH, datSum, by="EVTYPE")
datH <- melt(datH[,c("EVTYPE","FATALITIES","INJURIES")], id=c("EVTYPE"))
#PLot the results
qplot(x=EVTYPE, y=value, fill=variable, data=datH,
geom="bar", stat="identity",position="dodge",
xlab="Type of Weather Event",
main="Worst Events by Human Impact")
The second question that we need to answer is: “Across the United States, which types of events have the greatest economic consequences?”
In order to answer this question, I plotted the worst events by total property damage and crop damage combined. Here are the results: The events that stand out the most from an economic impact standpoint are FLOODS and TORNADOES.
#Generate final list of events (top 3) for the "economic damage"" metrics
datH <- c(as.character(getTopNEvents(datSum, "DAMAGE", N=5)[,"EVTYPE"]))
datH <- data.frame(EVTYPE=datH)
#Add back all the attributes to the event IDs
datH <- merge(datH, datSum, by="EVTYPE")
datH <- melt(datH[,c("EVTYPE","DAMAGE")], id=c("EVTYPE"))
#PLot the results
qplot(x=EVTYPE, y=value, fill=variable, data=datH,
geom="bar", stat="identity",
xlab="Type of Weather Event",
ylab="Total Property and Crop Damage (in $M)",
main="Worst Events by Economic Impact")
July 24, 2014