This analysis considers the National Weather Service's storm data from 1989 to November 2011. The purpose is to determine the types of events that have the greatest impact on population health and economic losses. From 1989 to 2011, tornadoes were responsible for the most injuries, by far, of any storm event type. However, excessive heat caused more deaths during this same time period. In addition, out of over $451 billion in total damage, 89% was to property and 11% was to crops. Of note, drought causes mostly crop damage while storm surge causes mostly property damage, and tornadoes are not responsible for the most damage - flooding causes the most economic damage overall.
Load necessary R packages.
#install.packages("knitr")
library(knitr)
#install.packages("stats")
library(stats)
#install.packages("ggplot2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
#install.packages("reshape")
library(reshape)
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
##
## rename
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Load the data.
storm <- read.csv("repdata-data-StormData.csv.bz2", header=TRUE)
Our goal is to identify events that are most harmful to population health. To answer this question, we will consider “harmful” to be counts of injuries and fatalities resulting from each event.
We also need to identify types of events with the greatest economic consequences. This will be answered with the dollar amount values in the property damage and crop damage variables.
Data for earlier years is sparse, so we must determine how many records are available for each year.
First we must convert the field BGN_DATE from a factor variable to a date variable and determine how to sub-set the data by year.
storm$begin_date <- as.Date(storm$BGN_DATE, format = "%m/%d/%Y")
storm$begin_date_year <- as.numeric(format(storm$begin_date, "%Y"))
table(storm$begin_date_year)
##
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961
## 223 269 272 492 609 1413 1703 2184 2213 1813 1945 2246
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## 2389 1968 2348 2855 2388 2688 3312 2926 3215 3471 2168 4463
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
## 5386 4975 3768 3728 3657 4279 6146 4517 7132 8322 7335 7979
## 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## 8726 7367 7257 10410 10946 12522 13534 12607 20631 27970 32270 28680
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663 45817
## 2010 2011
## 48161 62174
Looking at a summary table of the number of records for each year, it appears that 1989 sees a significant jump in the number of events. Therefore, we will consider only records with events that began in 1989 or later.
storm_1989 <- subset(storm, begin_date_year >= 1989)
We now have all weather events for 1989 to 2011. However, many of the event types have typos or extraneous spaces. We will remove leading and trailing spaces and ensure all text is in uppercase, and create a new field for event type (preserving the original event type in case it is needed).
storm_1989$event_type <- gsub("^\\s+|\\s+$", "",toupper(storm_1989$EVTYPE))
We will summarize injuries and fatalities by event type, sort in descending order, and view the top events for both fatalities and injuries.
inj_fat <- data.frame(aggregate(cbind(FATALITIES, INJURIES) ~ event_type, data = storm_1989, sum))
fatalities <- inj_fat[order(-inj_fat$FATALITIES), ]
injuries <- inj_fat[order(-inj_fat$INJURIES), ]
head(fatalities)
## event_type FATALITIES INJURIES
## 108 EXCESSIVE HEAT 1903 6525
## 750 TORNADO 1802 27944
## 130 FLASH FLOOD 978 1777
## 235 HEAT 937 2100
## 410 LIGHTNING 816 5230
## 146 FLOOD 470 6789
head(injuries)
## event_type FATALITIES INJURIES
## 750 TORNADO 1802 27944
## 146 FLOOD 470 6789
## 108 EXCESSIVE HEAT 1903 6525
## 771 TSTM WIND 356 5404
## 410 LIGHTNING 816 5230
## 235 HEAT 937 2100
As seen by the sorted tables, tornadoes are responsible for far more injuries than any other event type, but excessive heat is responsible for more fatalities. These differences are worth noting, as shown in the plot of the top five event types by fatalities.
To further understand the data, we will analyze the top five event types by fatalities and view the number of injuries associated with them as well. To do this, we take the top event types with 470 or more fatalities, melt the table, and create a clustered bar chart.
fatalities_top <- subset(fatalities, FATALITIES >= 470)
injuries_top <- subset(injuries, INJURIES >= 2100)
fatalities_melt <- melt(fatalities_top, id.vars="event_type")
png("figure/fatalities-injuries.png", width = 650, height = 480)
ggplot(fatalities_melt, aes(factor(event_type), value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
scale_fill_brewer(palette="Set1") +
ggtitle("Fatalities and Injuries by Top Five Events") +
xlab("Event Type") +
ylab("Total Fatalities or Injuries")
dev.off()
## quartz_off_screen
## 2
We can therefore conclude that tornados cause the most injuries by far, and excessive heat causes the most fatalities, closely followed by tornadoes.
To analyze property damage (PROPDMG) and crop damage (CROPDMG) figures, we must note that there are additional fields indicating the units associated with these figures (PROPDMGEXP and CROPDMGEXP respectively). If the value is K, the dollar amount is in thousands, if the value is M, the dollar amount is in millions, etc. While this is not necessarily sound database design, we will account for it by creating two new variables that will be the units by which we multiply the dollar amounts.
storm_1989 <- mutate(storm_1989, propdmgunit=0, cropdmgunit=0)
prop_hundreds <- grep("h|H", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_hundreds] = 100
prop_thousands <- grep("k|K", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_thousands] = 1000
prop_millions <- grep("m|M", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_millions] = 1000000
prop_billions <- grep("b|B", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_billions] = 1000000000
crop_thousands <- grep("k|K", storm_1989$CROPDMGEXP)
storm_1989$cropdmgunit[crop_thousands] = 1000
crop_millions <- grep("m|M", storm_1989$CROPDMGEXP)
storm_1989$cropdmgunit[crop_millions] = 1000000
crop_billions <- grep("b|B", storm_1989$CROPDMGEXP)
storm_1989$cropdmgunit[crop_billions] = 1000000000
Now that we have the proper units, we can multiply the property and crop damage values by these units and aggregate by event type. The top 10 event types are determined and ordered by total damage.
storm_1989$property_damage <- storm_1989$PROPDMG * storm_1989$propdmgunit
storm_1989$crop_damage <- storm_1989$CROPDMG * storm_1989$cropdmgunit
storm_1989$total_damage <- storm_1989$property_damage + storm_1989$crop_damage
damage <- data.frame(aggregate(cbind(property_damage, crop_damage, total_damage) ~ event_type, data = storm_1989, sum))
head(damage[order(-damage$total_damage), ], 10)
## event_type property_damage crop_damage total_damage
## 146 FLOOD 144657709800 5661968450 150319678250
## 364 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 591 STORM SURGE 43323536000 5000 43323541000
## 750 TORNADO 32218566360 414953110 32633519470
## 204 HAIL 15732267220 3025954450 18758221670
## 130 FLASH FLOOD 16140861510 1421317100 17562178610
## 76 DROUGHT 1046106000 13972566000 15018672000
## 355 HURRICANE 11868319010 2741910000 14610229010
## 521 RIVER FLOOD 5118945500 5029459000 10148404500
## 379 ICE STORM 3944927810 5022113500 8967041310
Now we will create a stacked bar chart to visualize property and crop damage for the top 10 event types.
damage_top <- subset(damage, total_damage >= 8967041310)
damage_melt <- melt(damage_top, id.vars="event_type")
damage_melt_sub <- subset(damage_melt, variable != "total_damage")
png("figure/property-crop-damage.png", width = 480, height = 480)
ggplot(damage_melt_sub, aes(factor(event_type), value, fill=variable)) +
geom_bar(stat="identity", position="stack") +
scale_fill_brewer(palette="Set2") +
ggtitle("Property and Crop Damage by Top 10 Event Types") +
xlab("Event Type") +
ylab("Total Damage in $USD") +
coord_flip()
dev.off()
## quartz_off_screen
## 2
As seen in the plot, floods, hurricanes/typhoons, storm surge, and tornadoes account for the greatest economic losses. Also of note, property damage accounts for the vast majority of damage for most events, except for drought (which makes sense, since crops would be most affected by that). Storm surge results in very little crop damage, which also make sense, since it's usually cities by coasts, rather than farmland. Also note that while tornadoes account for many more injuries and deaths than any other event type, they do not account for the most economic damage.
The differences between property and crop damage is vast, as seen in this staked bar chart of damage by year.
total_property_damage <- sum(storm_1989$property_damage)
total_crop_damage <- sum(storm_1989$crop_damage)
total_damage <- total_property_damage + total_crop_damage
percent_prop_damage <- round(100*(total_property_damage/total_damage), 0)
percent_crop_damage <- 100-percent_prop_damage
damage_by_year <- data.frame(aggregate(cbind(property_damage, crop_damage) ~ begin_date_year, data = storm_1989, sum))
damage_by_year_melt <- melt(damage_by_year, id.vars="begin_date_year")
png("figure/damage-by-year.png", width = 800, height = 480)
ggplot(damage_by_year_melt, aes(factor(begin_date_year), value, fill=variable)) +
geom_bar(stat="identity", position="stack") +
scale_fill_brewer(palette="Set2") +
ggtitle("Property and Crop Damage by Year, 1989-2011") +
xlab("Event Type") +
ylab("Total Damage in $USD")
dev.off()
## quartz_off_screen
## 2
In total, there was 4.0260005 × 1011 in property damage and 4.9104192 × 1010 in crop damage. It is also interesting to note that of the total 4.5170424 × 1011 in damage, a whopping 89 percent of of damage is property damage, while only 11 percent is crop damage.
As seen in the chart, there was no crop damage until 1993, so that data likely wasn't collected until that year. Clearly visible is the spike in 2005 and 2006, almost certainly due to Hurricane Katrina. However, since that event and subsequent flooding occurred only in 2005, and 2006 sees even more damage, it would be a safe assumption that the quality of this data (specifically begin date) is questionable.