The following data analysis addresses the two questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This report acts as an aid for municipal managers or other government officials looking to prioritize resources for various weather-related events. Though we make no specific recommendations, we are able to tease out of the NOAA Storm Database, those events that have the largest impact on health and the economy. Utilizing simple graphs, charts and summaries the reader will find that our results clearly show a primary source of concern. This report has been created in such a way that these commands are reproducible containing every programming code used in the analysis along with a description of the thinking that goes into such an analysis. The results will show that the types of events most harmful with respect to health are primarly wind related. Conversely, the events with the greatest economic consequences revolve around the costal areas and are primarly water related.
The data comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can be found at the following URL: Data
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
After downloading the data to the working directory for R, we use a single command to load and parse it from the raw data file repdata-data-StormData.csv.bz2.
# load the data
stormData <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
To process the data we utilize a different technique for each question, each technique utilizes the stormData variable as it's default.
The resulting data.file contains 902,297 observations of 37 variables.
Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
To answer this question we utilize the FATALITIES and INJURIES columns. Though we could have simply summed them up together (and ultimately we want to see them together) it was decided to keep them separate and use a stacked barplot to see it's impact separately for each of the most harmful events.
The first step is to aggregate the data into a new data.frame which we'll call, health. To start we total up the FATALITIES column by the events (EVTYPE). Then set the column names to event and fatalaties respectively. Finally we tack on the final column, injuries also by event.
# Health Impact - Data Processing
health <- aggregate(c(stormData$FATALITIES), by = list(stormData$EVTYPE), "sum")
colnames(health) <- c("event", "fatalaties")
health <- cbind(health, injuries = aggregate(c(stormData$INJURIES), by = list(stormData$EVTYPE),
"sum")$x)
str(health)
## 'data.frame': 985 obs. of 3 variables:
## $ event : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ fatalaties: num 0 0 0 0 0 0 0 0 0 0 ...
## $ injuries : num 0 0 0 0 0 0 0 0 0 0 ...
head(health)
## event fatalaties injuries
## 1 HIGH SURF ADVISORY 0 0
## 2 COASTAL FLOOD 0 0
## 3 FLASH FLOOD 0 0
## 4 LIGHTNING 0 0
## 5 TSTM WIND 0 0
## 6 TSTM WIND (G45) 0 0
This is a large list (985 Observations) that we then narrow down by dropping any row that contains only 0's in both columns. Row names are not helpful at the moment so they are removed.
# drop values that are just zero
health <- health[health$fatalaties > 0 | health$injuries > 0, ]
rownames(health) <- NULL
head(health)
## event fatalaties injuries
## 1 AVALANCE 1 0
## 2 AVALANCHE 224 170
## 3 BLACK ICE 1 24
## 4 BLIZZARD 101 805
## 5 blowing snow 1 1
## 6 BLOWING SNOW 1 13
The next step is to extract just the important events. The simplest way to do this is to just sort by the fatalaties and injuries and grab the top ones.
health <- head(health[order(health$fatalaties, health$injuries, decreasing = T),
])
health
## event fatalaties injuries
## 184 TORNADO 5633 91346
## 32 EXCESSIVE HEAT 1903 6525
## 42 FLASH FLOOD 978 1777
## 69 HEAT 937 2100
## 123 LIGHTNING 816 5230
## 191 TSTM WIND 504 6957
The plotting data requires a matrix as it's input, so we create a special matrix variable from the health data that we'll use to plot and setup the event column as the row names.
plotHealth <- as.matrix(health[, c("fatalaties", "injuries")])
rownames(plotHealth) <- health$event
In order to plot our new matrix we need to transpose it which we do as part of the plotting process. The output of which gives us:
t(plotHealth)
## TORNADO EXCESSIVE HEAT FLASH FLOOD HEAT LIGHTNING TSTM WIND
## fatalaties 5633 1903 978 937 816 504
## injuries 91346 6525 1777 2100 5230 6957
A stacked bar chart reveals the relative differences concerning those affected given the top events.
par(oma = c(4, 1, 0, 1))
barplot(height = t(plotHealth), width = 1, col = 1:2, legend.text = c("fatalaties",
"injuries"), main = "Events Most Harmful with Respect to Population Health",
ylab = "Affected People", las = 3)
The black portions of each bar represent the fatalaties for each event, whereas the red component displays injuries. This graph is meant to display relative amounts of health impact whereas the above chart gives absolute values if needed.
Nothing comes close to tornadoes in terms of public health, resulting in 5,633 fatalaties and 91,346 injuries in our dataset. Though wind causes more injuries than excessive heat (6957 vs 6525), the heat is cause for more deaths (937 vs 504). The worst six weather related events are (in order of fatalaties): Tornadoes, Excessive Heat, Flood, Heat, Lightning and TSTM Wind.
Question 2: Across the United States, which types of events have the greatest economic consequences?
To answer this question we have to review the data in some detail. There are many errors in the economic data which can be quite misleading. Click Here for Details. It was therefore determined that since we were are looking for only the greatest economic consequences, identifying the smaller details of economic impact could successfully be set aside in an attempt to determine only the most prevalent events.
Reviewing the data reveals that there are far fewer events that make their impact in the billions.
length(stormData$PROPDMGEXP[stormData$PROPDMGEXP == "B"])
## [1] 40
Since it would take 1,000 entries of $1 millon to reach this level of impact we should be able to set aside all other data in view of these very large impacts. Revealing the unique events for this dataset confirms our suspicions.
unique(stormData$EVTYPE[stormData$PROPDMGEXP == "B"])
## [1] WINTER STORM HURRICANE OPAL/HIGH WINDS
## [3] HURRICANE OPAL TORNADOES, TSTM WIND, HAIL
## [5] RIVER FLOOD HEAVY RAIN/SEVERE WEATHER
## [7] SEVERE THUNDERSTORM FLOOD
## [9] HURRICANE WILD/FOREST FIRE
## [11] TROPICAL STORM FLASH FLOOD
## [13] WILDFIRE HURRICANE/TYPHOON
## [15] HIGH WIND STORM SURGE
## [17] STORM SURGE/TIDE HAIL
## [19] TORNADO
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
These are the type of events one would associate with the the majority of economic impact from weather related events. Continuing in this frame we create an economic dataset that looks at events in the billions.
economicImpact <- table(stormData$EVTYPE[stormData$PROPDMGEXP == "B"])
sort(economicImpact[economicImpact > 1], decreasing = T)
##
## HURRICANE/TYPHOON FLOOD HURRICANE TORNADO
## 12 5 3 3
## HURRICANE OPAL STORM SURGE
## 2 2
Plotting this reveals those events that have the largest economic impact.
plotData <- economicImpact[economicImpact > 1]
barplot(rev(sort(plotData)), legend.text = rownames(rev(sort(plotData))), col = 1:6,
axisnames = F, ylab = "Number of Billion Dollar Events", main = "Events with the Greatest Economic Consequences")
Each bar is represented by a color which corresponds to the event type of the same color in the legend. The number of billion dollar events is listed on the y-axis in an effort to display the relative size of each event.
Events at sea plainly have the largest impact on property with Hurricanes (and Typhoons) leading the way. Flooding is next, followed by Tornadoes.