This analysis utilizes data from the National Climatic Data Center of the United States Department of Commerce to address the broad public health and economic impacts of severe weather events. Data on severe weather events has been collected by the National Weather Service between 1950 and 2011.
This report represents my submission for Peer Assessment #2 for “Reproducable Research”, Course #5 of 9 in the Johns Hopkins Univeristy Data Science track on Coursera. It represents my work alone, and is intended for use in this course only.
This analysis relies on the following packages. The embedded code includes the library calls for each package, but it assumes these packages have already been installed.
The data was downloaded from the Course assignment page: https://class.coursera.org/repdata-034/human_grading/view/courses/975147/assessments/4/submissions
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(xtable)
library(lubridate)
library(ggplot2)
#download.file(
# "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
# "StormData.csv.bz2", method="curl")
stormdata <- read.csv("StormData.csv.bz2")
The two questions this report seeks to answer both have to do with maximum impacts of different types of weather events:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This analysis considers the two variables that represent health effects (INJURIES and FATILITIES) and the two sets of variables that represent economic effects (PROPDMG and CROPDMG). The health variables are fairly straightforward, but the economic variables require some additional pre-processing. There is an separate exponent variable for each (PROPDMGEXP and CROPDMGEXP), but they do not have a consistent structure.
Sometimes the exponent is a number and sometimes it is a letter, for example “M” or “m” for millions. This chunk of code converts all of the exponents to a common format and then multiplies through to get new, single numeric values for PROPCOST and CROPCOST.
# Clean up exponent for property damage and calculate PROPCOST
propexp <- tolower(stormdata$PROPDMGEXP)
propexp[propexp == "b"] <- 9 #billions = 10^9
propexp[propexp == "m"] <- 6 #millions = 10^6
propexp[propexp == "k"] <- 3 #thousands = 10^3
propexp[propexp == "h"] <- 2 #hundreds = 10^2
propexp <- as.numeric(propexp) #coerce non-numeric to blanks
## Warning: NAs introduced by coercion
stormdata$PROPCOST <- stormdata$PROPDMG * 10^(propexp)
stormdata$PROPCOST[is.na(stormdata$PROPCOST)] <- 0
# Clean up exponent for crop damage and calculate CROPCOST
cropexp <- tolower(stormdata$PROPDMGEXP)
cropexp[cropexp == "b"] <- 9 #billions = 10^9
cropexp[cropexp == "m"] <- 6 #millions = 10^6
cropexp[cropexp == "k"] <- 3 #thousands = 10^3
cropexp[cropexp == "h"] <- 2 #hundreds = 10^2
cropexp <- as.numeric(cropexp) #coerce non-numeric to blanks
## Warning: NAs introduced by coercion
stormdata$CROPCOST <- stormdata$CROPDMG * 10^(cropexp)
stormdata$CROPCOST[is.na(stormdata$CROPCOST)] <- 0
# Create total cost variable as sum of PROPCOST and CROPCOST
stormdata$TOTALCOST <- stormdata$PROPCOST + stormdata$CROPCOST
#Number of events by outcome
eventsTotal <- nrow(stormdata)
eventsInjury <- sum(stormdata$INJURIES > 0)
eventsFatal <- sum(stormdata$FATALITIES > 0)
eventsProperty <- sum(stormdata$PROPCOST > 0)
eventsCrops <- sum(stormdata$CROPCOST > 0)
Out of 902297 events recorded, there were:
Weather is hard to categorize, often because single events are made up of multiple parts. Many different events can happen in the same span of time which can make them difficult to separate out. For example, thunderstorm events often include wind, rain, hail, and sometimes tornados. Winter events can include snow, sleet, freezing rain, ice, or all of the above - not to mention wind and cold temperatures. Once in a while you’ll even see all of the above - THUNDERSNOW!
The event type within the EVTYPE variable in the original dataset has many unique values, many of which overlap. Instead of just one value for ‘HURRICANE’, there are also values of ‘HURRICANE EMILY’ and ‘HURRICANE FELIX’. The term ‘THUNDERSTORM’ appears in many different values of EVTYPE, and is sometimes even abbreviated as ‘TSTM’ instead.
This analysis groups the various iterations of different types of weather events into a manageable number of discrete categories. These are stored in a new variable called EVCAT, for EVent CATegory. I’ve used the ‘grep’ function to search for characteristic text strings in the EVTYPE variable that correspond to each category. For example, the category ‘tropical’ includes events like hurricanes, typhoons, and tropical storms. It is made up of events that contain the strings ‘hurricane’, ‘tropical’, and ‘typhoon’. The category ‘tstorm’ is made up of the descriptive strings like ‘thunder’ and ‘tstm’, but also components of thunderstorms like ‘lightning’ and ‘hail’.
Wind and rain are part of thunderstorms, but you can also get wind and rain outside of a thunderstorm. So I’ve created a hierarchy of events, and assigned the value of EVCAT in the order of that hierarchy, with lower-order events like rain first and higher-order events like thunderstorms later. That way, if an event has two or more key text strings in its EVTYPE description, it will be assigned the lower-order EVCAT value first and then reassigned the higher-order EVCAT value later. In other words, an EVTYPE of ‘tstm and wind’ will first get the EVCAT value ‘wind’ but end up with the EVCAT value ‘tstorm’. Starting off with an EVCAT value of ‘other’ for all of the entries ensures that any event that does not have a matching text string will end up with an EVCAT value of ‘other’.
This is an approximation, but reducing the number of weather categories from 9xx to 16 makes for a more easily understandable analysis. In other words, it’s pretty good for a first cut, and could be refined more later if the original analysis led to more detailed questions to be answered (and enough time and resources were allocated to do that more detailed analysis).
## Categorize events
stormdata$EVTYPE <- tolower(stormdata$EVTYPE)
stormdata$EVCAT <- "other"
#Basic Singular Events
stormdata$EVCAT[grep("fog", stormdata$EVTPYE)] <- "fog"
stormdata$EVCAT[grep("hot|heat|warm", stormdata$EVTYPE)] <- "heat"
stormdata$EVCAT[grep("cold|cool|frost|freeze", stormdata$EVTYPE)] <- "cold"
stormdata$EVCAT[grep("surf|swell|rip|tide", stormdata$EVTYPE)] <- "surf"
stormdata$EVCAT[grep("drought|dry", stormdata$EVTYPE)] <- "drought"
#Precipitation
stormdata$EVCAT[grep("rain", stormdata$EVTYPE)] <- "rain"
stormdata$EVCAT[grep("wind|microburst", stormdata$EVTYPE)] <- "wind"
stormdata$EVCAT[grep("snow|sleet|freezing|ice|icy|blizzard|winter|mix", stormdata$EVTYPE)] <- "snow"
stormdata$EVCAT[grep("flood|fld", stormdata$EVTYPE)] <- "flood"
#Storms
stormdata$EVCAT[grep("thunder|tstm|lightning|hail", stormdata$EVTYPE)] <- "tstorm"
stormdata$EVCAT[grep("tornado|waterspout|funnel", stormdata$EVTYPE)] <- "tornado"
stormdata$EVCAT[grep("tropical|hurricane|typhoon", stormdata$EVTYPE)] <- "tropical"
stormdata$EVCAT[grep("surge", stormdata$EVTYPE)] <- "surge"
#Catastrophies
stormdata$EVCAT[grep("fire", stormdata$EVTYPE)] <- "fire"
stormdata$EVCAT[grep("tsunami", stormdata$EVTYPE)] <- "tsunami"
stormdata$EVCAT[grep("volcanic", stormdata$EVTYPE)] <- "volcano"
stormdata$EVCAT[grep("avalanche|slide", stormdata$EVTYPE)] <- "landslide"
The data is collected over a long period of time, from 1950 through 2011. A lot has changed over that time, including the way data is collected - particularly the amount and level of detail of data that is collected. So we’ll take a look at the annual totals for impacts to health and ecomonic costs for each of the categories we’ve created.
stormdata$YEAR <- year(mdy_hms(stormdata$BGN_DATE))
annualSum <- aggregate(cbind(INJURIES, FATALITIES, TOTALCOST) ~ EVCAT + YEAR, sum,
data=stormdata)
annualAve <- aggregate(cbind(INJURIES, FATALITIES, TOTALCOST) ~ EVCAT + YEAR, mean,
data=stormdata)
injPlot <- ggplot(annualSum, aes(YEAR, INJURIES, color=EVCAT))
print(injPlot + geom_line())
injTable <- select(annualSum[annualSum$INJURIES > 5000,], EVCAT, YEAR, INJURIES)
kable(injTable, align=c('c','c','c','c'))
| EVCAT | YEAR | INJURIES | |
|---|---|---|---|
| 4 | tornado | 1953 | 5131 |
| 26 | tornado | 1965 | 5197 |
| 44 | tornado | 1974 | 6824 |
| 161 | flood | 1998 | 6445 |
| 376 | tornado | 2011 | 6163 |
This plot shows the total number of injuries per year for each of the weather categories we created earlier. It’s clear tornados have had some of the highest health costs over the entire span of this data, with an number of years with over 2,000 injuries from that category of weather. There were tornado events in 1953, 1965, 1974, and 2011 that injured more than 5,000 people. You can also see a large flood event in 1998 where over 6,000 people were injured. It is also interesting to note that injuries from tornados seem to have reduced since the mid-1980s, suggesting our ability to identify and warn people to take cover has improved markedly since then.
fatPlot <- ggplot(annualSum, aes(YEAR, FATALITIES, color=EVCAT))
print(fatPlot + geom_line())
fatTable <- select(annualSum[annualSum$FATALITIES > 200,], EVCAT, YEAR, FATALITIES)
kable(fatTable, align=c('c','c','c','c'))
| EVCAT | YEAR | FATALITIES | |
|---|---|---|---|
| 3 | tornado | 1952 | 230 |
| 4 | tornado | 1953 | 519 |
| 26 | tornado | 1965 | 301 |
| 44 | tornado | 1974 | 366 |
| 116 | heat | 1995 | 1056 |
| 177 | heat | 1999 | 502 |
| 288 | heat | 2006 | 252 |
| 376 | tornado | 2011 | 587 |
This plot shows the total number of fatalities per year for each of the weather categories we created earlier. As with injuries, many of the events with the highest number of fatalities appear to be tornados. The same four outbreaks in 1953, 1965, 1974, and 2011 killed more than 250 people each. Heat is the other biggest killer, with three heat wave events in 1995, 1999, and 2006 that killed more then 250 people. In fact the worst single event for human fatalities was the heat wave of 1995 that killed 1,056 people.
dmgPlot <- ggplot(annualSum, aes(YEAR, TOTALCOST, color=EVCAT))
print(dmgPlot + geom_line())
dmgTable <- select(annualSum[annualSum$TOTALCOST > 10000000000,], EVCAT, YEAR, TOTALCOST)
kable(dmgTable, align=c('c','c','c','c'))
| EVCAT | YEAR | TOTALCOST | |
|---|---|---|---|
| 85 | flood | 1993 | 17117336100 |
| 123 | tropical | 1995 | 18717932000 |
| 161 | flood | 1998 | 13777705050 |
| 170 | tropical | 1998 | 304767042000 |
| 185 | tropical | 1999 | 505101461000 |
| 264 | tropical | 2004 | 448295290000 |
| 278 | surge | 2005 | 43058565000 |
| 280 | tropical | 2005 | 352975473330 |
| 287 | flood | 2006 | 156779168070 |
| 304 | flood | 2007 | 10746512460 |
| 320 | flood | 2008 | 16515225310 |
| 368 | flood | 2011 | 17309877150 |
| 376 | tornado | 2011 | 12289675800 |
This plot shows the total amount of damage in dollars per year for each of the weather categories we created earlier. The total damage shown here includes damage to both property and crops. Far and away the weather events that produce the highest economic costs are tropical storms and hurricanes. The tropical seasons of 1998, 1999, 2004, and 2005 all resulted in over $300 billion in damage each, with the worst year of 1999 costing the US over half a trillion dollars in damage to property and crops. Flood are the other big source of economic damage, with six years of damages topping $10 billion each. The tornado outbreak of 2011 which killed nearly 600 people and injured over 6,000 more also resulted in over $12 billion in total damages.
From the plots above, it’s clear that most categories appear in the data only after 1993. In fact, before that year the only two categories collected were ‘tornado’ and ‘tstorm’. And before 1955, the only category collected was ‘tornado’. So to compare total health and economic costs across categories, we’ll focus the analysis here on data from 1993 onward. If we included the years before that in our totals, the results would include a big skew towards ‘tornado’ and ‘tstorm’.
This section examines the cumulative totals start to aggregate different stats across the EVCAT variable, concentrating on the years 1993-2011 when all of the categories were being recorded.
recent <- stormdata[stormdata$YEAR >= 1993, ]
totals <- aggregate(cbind(INJURIES, FATALITIES, PROPCOST, CROPCOST) ~ EVCAT, sum,
data=recent)
averages <- aggregate(cbind(INJURIES, FATALITIES, PROPCOST, CROPCOST) ~ EVCAT, mean,
data=recent)
mostInjuries <- arrange(select(totals, EVCAT, INJURIES), desc(INJURIES))
mostFatalities <- arrange(select(totals, EVCAT, FATALITIES), desc(FATALITIES))
highestAveInjuries <- arrange(select(averages, EVCAT, INJURIES), desc(INJURIES))
highestAveFatalities <- arrange(select(averages, EVCAT, FATALITIES), desc(FATALITIES))
top10injuries <- arrange(top_n(stormdata, 10, INJURIES)[c("EVTYPE","INJURIES")],
desc(INJURIES))
top10fatalities <- arrange(top_n(stormdata, 10, FATALITIES)[c("EVTYPE","FATALITIES")],
desc(FATALITIES))
Which categories of events have caused the most injuries and the most fatalities since 1993?
| EVCAT | INJURIES | EVCAT | FATALITIES |
|---|---|---|---|
| tornado | 23403 | heat | 3143 |
| tstorm | 12420 | tornado | 1652 |
| heat | 9228 | flood | 1553 |
| flood | 8683 | tstorm | 1295 |
| snow | 6184 | surf | 736 |
Which categories have the highest average number of injuries and fatalities since 1993?
| EVCAT | INJURIES | EVCAT | FATALITIES |
|---|---|---|---|
| tsunami | 6.4500000 | tsunami | 1.6500000 |
| heat | 3.1196755 | heat | 1.0625423 |
| tropical | 1.6250000 | surf | 0.3463529 |
| tornado | 0.6364180 | landslide | 0.2579403 |
| other | 0.6140231 | tropical | 0.1903409 |
What category were the top ten events with the most injuries and most fatalities overall?
| EVTYPE | INJURIES | EVTYPE | FATALITIES |
|---|---|---|---|
| tornado | 1700 | heat | 583 |
| ice storm | 1568 | tornado | 158 |
| tornado | 1228 | tornado | 116 |
| tornado | 1150 | tornado | 114 |
| tornado | 1150 | excessive heat | 99 |
| flood | 800 | tornado | 90 |
| tornado | 800 | tornado | 75 |
| tornado | 785 | excessive heat | 74 |
| hurricane/typhoon | 780 | excessive heat | 67 |
| flood | 750 | tornado | 57 |
Tornados and thunderstorms injured the most people betwen 1993 and 2011, while heat events were the biggest killer and a significant source of injuries as well. The table of averages suggests that while tsunami events do not happen all that often, when they do they take a great toll on population health in the form of injuries and fatalities. Tropical events also fall into this category of large events with high average health costs.
The last table here looks back at the EVTYPE variable from the original dataset to take a finer- grained look at the top ten events for injuries and fatalities since 1950. This needs to be taken with a grain of salt, since as we determined earlier tornados and thunderstorms are likely to be over-represented here, since they were the only type of event being recorded from 1950 to 1992.
So the fact that six of the top ten events that caused the most injuries and the most fatalities since 1950 were tornado outbreaks is interesting, but it may be a bit misleading. Also, this table will tend to under-represent events that cross state lines, since each record in the original data table is separated out by location. For example, if a flood caused injuries in several states, the events in this table may only represent the injuries from one of those states.
mostPropDamage <- arrange(select(totals, EVCAT, PROPCOST), desc(PROPCOST))
mostCropDamage <- arrange(select(totals, EVCAT, CROPCOST), desc(CROPCOST))
highestAvePropDamage <- arrange(select(averages, EVCAT, PROPCOST), desc(PROPCOST))
highestAveCropDamage <- arrange(select(averages, EVCAT, CROPCOST), desc(CROPCOST))
top10propDamage <- arrange(top_n(stormdata, 10, PROPCOST)[c("EVTYPE","PROPCOST")],
desc(PROPCOST))
top10cropDamage <- arrange(top_n(stormdata, 10, CROPCOST)[c("EVTYPE","CROPCOST")],
desc(CROPCOST))
Which categories have caused the most property damage and most crop damage since 1993?
| EVCAT | PROPCOST | EVCAT | CROPCOST |
|---|---|---|---|
| flood | 168269944238 | tropical | 1.552691e+12 |
| tropical | 93072537560 | flood | 1.351411e+11 |
| surge | 47965224000 | tstorm | 3.269518e+10 |
| tstorm | 28098417355 | tornado | 3.076988e+10 |
| tornado | 28014883594 | wind | 9.260054e+09 |
Which categories have the caused highest average property costs and crop costs since 1993?
| EVCAT | PROPCOST | EVCAT | CROPCOST |
|---|---|---|---|
| surge | 116703708 | tropical | 1470351052 |
| tropical | 88136873 | fire | 1832552 |
| tsunami | 7203100 | flood | 1569328 |
| fire | 2005101 | drought | 1317350 |
| flood | 1954037 | tsunami | 1000000 |
What category were the top ten events with the most damage to property and to crops overall?
| EVTYPE | PROPCOST | EVTYPE | CROPCOST |
|---|---|---|---|
| flood | 1.150e+11 | hurricane | 5.00e+11 |
| storm surge | 3.130e+10 | hurricane | 3.01e+11 |
| hurricane/typhoon | 1.693e+10 | hurricane/typhoon | 3.00e+11 |
| storm surge | 1.126e+10 | hurricane/typhoon | 2.85e+11 |
| hurricane/typhoon | 1.000e+10 | hurricane/typhoon | 9.32e+10 |
| hurricane/typhoon | 7.350e+09 | flood | 3.25e+10 |
| hurricane/typhoon | 5.880e+09 | hurricane/typhoon | 2.50e+10 |
| hurricane/typhoon | 5.420e+09 | hurricane/typhoon | 2.50e+10 |
| tropical storm | 5.150e+09 | hurricane opal/high winds | 1.00e+10 |
| winter storm | 5.000e+09 | wildfire | 6.50e+09 |
Floods, tropical events, thunderstorms, and tornados are all in the top five categories for total damage to property and crops. In the years between 1993 and 2011, these four types of events caused the most economic damage by far. The table of averages suggests that fire events and tsunamis are relatively rare, but cause a large amount of economic costs when they occur. It is also interesting to note that ‘surge’ events appears in the list of highest average costs to property. These types of events tend to accompany tropical events such as hurricanes and tropical storms. It was tempting to group them together in the original categorization of events. Leaving them separate suggests and interesting difference between the two, however. While tropical events cause high amounts of health and economic costs, the impacts of surge events tend to be mostly economic.
The last table here again examines the individual events that had the highest economic impacts, sorted by the EVCAT variable in the original dataset. This data reinforces the notion that tropical events like hurricanes and typhoons are the most damaging type to property and crops. Again, this data may under-represent some types of events that span multiple states. A more detailed assesment of the costliest single events would require additional aggregation of multiple records that occur over multiple locations in similar timeframes.
A public official looking for guidance on how to reduce the population health effects and economic costs of weather events in the United States could take a few ideas from this report, including areas for further future study.
In terms of impacts to population health, in the form of injuries and fatalities, it may be useful to focus on reducing overall impacts from tornados, heat waves, thunder storms, and floods. In particular, heat waves tend to kill and injure lots of people in relatively few events. This could be an area where targeted intervention could make a significant impact.
In terms of economic impacts, in the form of damage to property and crops, it may be useful to focus on protection from large tropical events like hurricanes, typhoons, and the resulting storm surges from such events. The fact that population health costs are relatively low for these events suggests that early warnings and evacuation orders have allowed people to escape the worst impacts of these events. But protecting property and crops, which can’t so easily be moved out of harms way, may require significant changes in other areas of policy.
The method of categoirzing weather events into a manageable number is a good way to get an overall look at the data here. But it may be worth parsing certain categories further in future analysis to be able to provide more detailed policy advice. In addition, the analysis in this report didn’t join up events that happened concurrently across multiple states. A further analysis that does this may be more useful in determining how to plan for such large events in the future.