Synopsis

This report describes an exploratory analysis of weather alerts documented in NOAA storm data from 1950-2011. Specifically, we address the types of events most harmful to human health and those which have the most economic impact. The analysis presented here only looks at data from 1990-2011 due to a paucity of data prior to 1990. Although there are 985 separate weather event types, the current document only looks at the 15 most common. The findings indicate that floods are the worst culprit in terms of property damage, tornadoes cause the most injuries and deaths are most attributable to excessive heat and tornadoes.

Data Processing

Initially, we download and inspect the data.

require(downloader); require(ggplot2); require(stats)
## Loading required package: downloader
## Loading required package: ggplot2
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download(data_url, dest = "StormData.csv.bz2", mode = "wb")
rawdata <- read.csv(bzfile("StormData.csv.bz2"), header = TRUE, stringsAsFactors = FALSE)
str(rawdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Next we prepare the data for analysis with some light pre-processing. First we limit the analysis to the variables of interest (N.B: this is a large dataset, so this may take a minute or two). Next, we reformat the date variable, keeping only “years” for our rough-grained time analysis.

Having been forewarned about the pausity of data in the early years of this data set, inspecting a histogram (not shown) indicates negligible data prior to 1990, so we subset the data so that the rest of the analysis only pertains to data after 1990.

subset.rawdata <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- rawdata[subset.rawdata]
data$years = as.numeric(format(as.Date(data$BGN_DATE, format="%m/%d/%Y %H:%M:%S"), "%Y"))
data$BGN_DATE <- NULL

Noting the presence of two separate variables related to crop and property damage, we investigate and see the magnitude codes on page 27 of the National Weather Service Storm Data Documentation.

We reformat these variables so that they can be used to calculate crop and property damage by weather event, combining them into one “econ” variable that we then plot (not shown) to examine the distribution of damage estimates.

data.sub <- data[data$year >= 1990, ]
unique(data.sub$PROPDMGEXP)
##  [1] ""  "M" "K" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(data.sub$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
str(data.sub)
## 'data.frame':    751740 obs. of  8 variables:
##  $ EVTYPE    : chr  "HAIL" "TSTM WIND" "TSTM WIND" "TSTM WIND" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 28 0 0 0 0 0 ...
##  $ PROPDMG   : num  0 0 0 0 2.5 0 0 0 25 25 ...
##  $ PROPDMGEXP: chr  "" "" "" "" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ years     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
# recode EXP variables
data.sub$CROPDMGEXP[toupper(data.sub$CROPDMGEXP) == 'B'] = "1000000000"
data.sub$CROPDMGEXP[toupper(data.sub$CROPDMGEXP) == 'M'] = "1000000"
data.sub$CROPDMGEXP[toupper(data.sub$CROPDMGEXP) == 'K'] = "1000"
data.sub$CROPDMGEXP[toupper(data.sub$CROPDMGEXP) == 'H'] = "100"
data.sub$CROPDMGEXP = as.numeric(data.sub$CROPDMGEXP)
## Warning: NAs introduced by coercion
data.sub$CROPDMGEXP[is.na(data.sub$CROPDMGEXP)] = 0
data.sub$cropdam = data.sub$CROPDMG * data.sub$CROPDMGEXP

data.sub$PROPDMGEXP[toupper(data.sub$PROPDMGEXP) == 'B'] = "1000000000"
data.sub$PROPDMGEXP[toupper(data.sub$PROPDMGEXP) == 'M'] = "1000000"
data.sub$PROPDMGEXP[toupper(data.sub$PROPDMGEXP) == 'K'] = "1000"
data.sub$PROPDMGEXP[toupper(data.sub$PROPDMGEXP) == 'H'] = "100"
data.sub$PROPDMGEXP = as.numeric(data.sub$PROPDMGEXP)
## Warning: NAs introduced by coercion
data.sub$PROPDMGEXP[is.na(data.sub$PROPDMGEXP)] = 0
data.sub$propdam = data.sub$PROPDMG * data.sub$PROPDMGEXP

data.sub$econ <- data.sub$propdam + data.sub$cropdam

hist2 <- ggplot(data.sub, aes(x = econ)) + geom_histogram(colour = "blue", fill = "lightgreen") + scale_x_log10() + ylab("Frequency") + xlab("Damages in Dollars")

In order to prepare the data for our analysis, we aggregate the data according to the variables relevant to the first question (fatalities and injuries) and the newly created variable for the second question (crop and property damage). Finally, seeing that there are 985 event types and wanting to constrain our analysis, we limit ourselves to the 20 event types with the highest counts for fatalities, injuries and economic damage.

fatalities <- aggregate(FATALITIES ~ EVTYPE, data = data.sub, FUN = sum)
fatalities_sub <- fatalities[order(-fatalities$FATALITIES), ][1:15, ]
injuries <- aggregate(INJURIES ~ EVTYPE, data = data.sub, FUN = sum)
injuries_sub <- injuries[order(-injuries$INJURIES), ][1:15, ]
econdamage <- aggregate(econ ~ EVTYPE, data = data.sub, FUN = sum)
econdamage_sub <- econdamage[order(-econdamage$econ), ][1:15, ]

Results

Plotting fatalties by event type, we see that excessive heat and tornadoes are the leading factors in weather-related deaths. Flash floods and heat are more minor contributors.

fatalities_sub.plot <- ggplot(data=fatalities_sub, aes(x=EVTYPE, y=FATALITIES)) + geom_bar(stat="identity") + 
                              scale_y_continuous("Fatalities") + theme(axis.text.x = element_text(angle = 50, size = 8, hjust = .9, )) +     
                              xlab("Event Type") + 
                              ggtitle("Fatalities by Event Type (1990-2011)") + 
                              geom_bar(colour="blue", fill="lightgreen", width=.7, stat="identity") +
                              guides(fill=FALSE)
plot(fatalities_sub.plot)

Plotting injuries by event type, it is evident tornadoes cause far more injuries than any other weather event types. Excessive heat and floods are less minor contributors.

injuries_sub.plot <- ggplot(data=injuries_sub, aes(x=EVTYPE, y=INJURIES)) + geom_bar(stat="identity") + 
                              scale_y_continuous("Injuries") + theme(axis.text.x = element_text(angle = 50, size = 8, hjust = .9, )) +     
                              xlab("Event Type") + 
                              ggtitle("Injuries by Event Type (1990-2011)") + 
                              geom_bar(colour="blue", fill="lightgreen", width=.7, stat="identity") +
                              guides(fill=FALSE)
plot(injuries_sub.plot)

When we plot crop and property damage by event type, we see that floods have the worst economic consequences by far. Hurricanes / typhoons, storm surge and tornadoes are

econdamage_sub.plot <- ggplot(data=econdamage_sub, aes(x=EVTYPE, y=econ/10^9)) + geom_bar(stat="identity") + 
                              scale_y_continuous("Damage in Dollars (Billions)") + 
                              theme(axis.text.x = element_text(angle = 50, size = 8, hjust = .9, )) +     
                              xlab("Event Type") + 
                              ggtitle("Crop and Property Damage by Event Type (1990-2011)") + 
                              geom_bar(colour="blue", fill="lightgreen", width=.7, stat="identity") +
                              guides(fill=FALSE)
plot(econdamage_sub.plot)

Thank you for your time and attention!