This analysis examines publicly available storm data, collected by the U.S. National Oceanic and Atmospheric Adminstration (NOAA) in their storm database. This data was collected from 1950 to 2011.
The database, as described on the NOAA website, has interesting points about the data:
These facts point us to the fact that there is some data cleanup needed in the event classification to get meaninful results. Because there is incomplete data prior to 1996, this analysis only covers from 1996 to 2011.
The purpose of this analysis is to reveal the major types of storm events and their impact on both the human population (deaths and injuries), and the economic impact (property and crop damage.)
The results of the analysis are shown in the Analysis and Results section.
First, we need to establish our work environment and download the file if we haven’t done so already. The downloaded csv file is called 2F2FStormData.csv.bz2.
setwd("~/GitHub/Storm-Data-Analysis")
UrlName = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
csvfile = "2F2FStormData.csv.bz2"
# don't download file if we have already
if (!file.exists(csvfile)) {
fileUrl = UrlName
download.file(fileUrl,destfile=csvfile)
dateDownloaded = date()
dateDownloaded
}
# Read the data and take a look
StormData = read.csv (csvfile)
dim(StormData)
## [1] 902297 37
str(StormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We only need a summary of this data, so we eliminate all but the 9 fields of interest for the analysis. These are the fields related to injuries, fatalities, and economic impact.
The NOA website at http://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype informs us that prior to 1996, they only recorded tornado, thunderstorm, wind and hail events. To ensure consistent detail, we will use a subset of the data which includes only events from 1996 through 2011.
StormData <- StormData[, c("BGN_DATE", "EVTYPE", "FATALITIES","INJURIES", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
dim(StormData)
## [1] 902297 8
summary(StormData$EVTYPE)
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE SNOW COLD/WIND CHILL
## 600 587 539
## FOG RIP CURRENT MARINE HAIL
## 538 470 442
## DUST STORM AVALANCHE WIND
## 427 386 340
## RIP CURRENTS STORM SURGE FREEZING RAIN
## 304 261 250
## URBAN FLOOD HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 249 228 204
## STRONG WINDS DRY MICROBURST ASTRONOMICAL LOW TIDE
## 196 186 174
## HURRICANE RIVER FLOOD LIGHT SNOW
## 174 173 154
## STORM SURGE/TIDE RECORD WARMTH COASTAL FLOODING
## 148 146 143
## DUST DEVIL MARINE HIGH WIND UNSEASONABLY WARM
## 141 135 126
## FLOODING ASTRONOMICAL HIGH TIDE MODERATE SNOWFALL
## 120 103 101
## URBAN FLOODING WINTRY MIX HURRICANE/TYPHOON
## 98 90 88
## FUNNEL CLOUDS HEAVY SURF RECORD HEAT
## 87 84 81
## FREEZE HEAT WAVE COLD
## 74 74 72
## RECORD COLD ICE THUNDERSTORM WINDS HAIL
## 64 61 61
## TROPICAL DEPRESSION SLEET UNSEASONABLY DRY
## 60 59 56
## FROST GUSTY WINDS THUNDERSTORM WINDSS
## 53 53 51
## MARINE STRONG WIND OTHER SMALL HAIL
## 48 48 47
## FUNNEL FREEZING FOG THUNDERSTORM
## 46 45 45
## Temperature record TSTM WIND (G45) Coastal Flooding
## 43 39 38
## WATERSPOUTS MONTHLY PRECIPITATION WINDS
## 37 36 36
## (Other)
## 2940
UniqueEvents = unique(StormData$EVTYPE)
noEvents = length(UniqueEvents)
There are 985 distinct values for event type. This list must be consolidated into fewer groups.
# select only the fatalities and injuries
StormSubset = StormData[StormData$FATALITIES>0 |
StormData$INJURIES > 0 | StormData$PROPDMG > 0 |
StormData$CROPDMG > 0, ]
StormSubset$Dt = as.POSIXlt(StormSubset$BGN_DATE,format="%m/%d/%Y %H:%M:%S")
# select the data since 1996
Since1996=StormSubset[format(StormSubset$Dt,"%Y") >= "1996",]
Apply a little cleanup to the EVTYPE field.
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
Since1996$EVTYPE = trim(Since1996$EVTYPE)
# convert all to upper
Since1996$EVTYPE = toupper(Since1996$EVTYPE)
From the above summary, the event type field is what we need to break the storm data into meaningful groups. However, the data is very dirty, e.g. TSTM WIND is most likely the same as THUNDERSTORM WIND. As a matter of fact, there are 985 unique values - we should try to analyze around 8 to 10 major storm groupings, consolidating some of the obvious duplicates.
Fortunately, a researcher, Coursea colleague, and R blogger named Mauricio Linhares has developed a substitiution table to eliminate the various misspellings and duplicate entries in the EVNTYPE field. His information about the process is at http://rpubs.com/jlutzwpi/56060. In the spirit of reproducible research, we choose to use his well-crafted substitution table to consolidate our events into a very workable number.
The translation file that normalizes the event data is called replacements.csv, which we merge with the event data.
url2 =
"https://raw.githubusercontent.com/mauricio/reproductible-research-assignment-2/master/replacements.csv"
if (!file.exists("replacements.csv")) {
download.file (url2,"replacements.csv")
}
repl = read.csv("replacements.csv",stringsAsFactors = FALSE)
names(repl) = c("x","EVTYPE","Event")
# Merge these over the EVTYPE field
Since1996Merged = merge(Since1996,repl,by="EVTYPE")
names(Since1996Merged) = c(names(Since1996Merged[1:10]),"Event")
dim(Since1996Merged)
## [1] 201318 11
# group the data by total fatalities and injuries
FatalandInjuries =aggregate(
x=Since1996Merged[,c("FATALITIES","INJURIES")],
by=list(Since1996Merged$Event),FUN="sum")
# create the ordered fatality and injury data
SortedFatal=
FatalandInjuries [order(-FatalandInjuries$FATALITIES),][1:10,]
SortedInjuries =
FatalandInjuries [order(-FatalandInjuries$INJURIES),][1:10,]
We also need to compute the cost of the events. First, some data cleanup must occur. The in both cases, dollar cost is codified by fields that indicate the magnitude of the cost: K-1000, M-millions and B-billions. A little math then calculates the cost of each.
Since1996Merged$PropFactor = 1
Since1996Merged$PropFactor [Since1996Merged$PROPDMGEXP=="K"] = 1000
Since1996Merged$PropFactor [Since1996Merged$PROPDMGEXP=="M"] = 1000000
Since1996Merged$PropFactor [Since1996Merged$PROPDMGEXP=="B"] = 1000000000
Since1996Merged$CropFactor = 1
Since1996Merged$CropFactor [Since1996Merged$CROPDMGEXP=="K"] = 1000
Since1996Merged$CropFactor [Since1996Merged$CROPDMGEXP=="M"] = 1000000
Since1996Merged$CropFactor [Since1996Merged$CROPDMGEXP=="B"] = 1000000000
Since1996Merged$PropertyExpense = Since1996Merged$PROPDMG * Since1996Merged$PropFactor
Since1996Merged$CropExpense = Since1996Merged$CROPDMG * Since1996Merged$CropFactor
Since1996Merged$TotalExpense =
Since1996Merged$PropertyExpense + Since1996Merged$CropExpense
EventCosts = aggregate(
x=Since1996Merged[
,c("PropertyExpense","CropExpense","TotalExpense")],
by=list(Since1996Merged$Event),FUN="sum")
# pick the top 10 most expensive event catagories
SortedEventCosts=EventCosts[order(-EventCosts$TotalExpense),][1:10,]
The top 10 worst weather events from 1996 to 2011 are shown below, both in terms of fatalities and injuries:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(gridExtra)
p1 = ggplot(data=SortedFatal,
aes(x=reorder(Group.1, -FATALITIES),
y=FATALITIES,fill=FATALITIES)) +
geom_bar(stat="identity", width = 0.5) +
xlab("Event") +
ylab("Total Number of Fatalities") +
theme(legend.position="none",
axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Top 10 Weather Event Fatalities 1996-2011")
p2 = ggplot(data=SortedInjuries, height = 5,
aes(x=reorder(Group.1, -INJURIES),
y=INJURIES,fill=INJURIES)) +
geom_bar(stat="identity", width = 0.5) +
xlab("Event") +
ylab("Total Number of Injuries") +
theme(legend.position="none",
axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Top 10 Weather Event Injuries 1996-2011")
grid.arrange( p1, p2, nrow=2)
Now, we need to analyze the cost of the major events. First, we look at the sorted expenses.
SortedEventCosts
## Group.1 PropertyExpense CropExpense TotalExpense
## 31 FLOOD 144003143200 4983266500 148986409700
## 53 HURRICANE 81118659010 5349282800 86467941810
## 86 STORM 47834724000 855000 47835579000
## 90 TORNADO 24616945710 283425010 24900370720
## 41 HAIL 14595213420 2496822450 17092035870
## 30 FLASH FLOOD 15222268910 1334901700 16557170610
## 20 DROUGHT 1046101000 13367566000 14413667000
## 92 TROPICAL STORM 12173376990 1296322600 13469699590
## 102 WILD/FOREST FIRE 7760449500 402255130 8162704630
## 52 HIGH WIND 5248378360 633561300 5881939660
The costs in this analysis have not been normalized for inflation. That is beyond the scope of the exercise.
In general, the property damage exceeds crop damage by considerable amount. We show both in a stacked bar chart. The following figure shows total storm-related expense by most expensive event:
library (reshape2)
SortedEventCosts=EventCosts[order(-EventCosts$TotalExpense),][1:10,]
Melted=melt(SortedEventCosts, id.var="Group.1")
Melted = Melted[Melted$variable != "TotalExpense",]
names(Melted) = c("Event","Type","Expense")
ggplot(Melted,aes(x=reorder(Event,-Expense),
y=Expense/1000000000,fill=Type)) +
geom_bar(stat="identity") +
ylab("Cost (in $Billions)") +
xlab("Event") +
theme (legend.position="none",
axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Top 10 Storm-Related Costs 1996-2011")
Reviewing the graphs, we find that tornadoes play a major factor in terms of injuries and death, ranking first in injuries and second in deaths. Heat-related situations result in the most deaths, and rank second in injuries. Overall cost of events find that floods are the largest contributor to economic cost of storm-related events. Both floods and flash floods are among the top six events. Drought is the only event that contributes primarily to crop damage.
One point of interest is the relationship between hurricanes and floods. Hurricane Katrina was the most expensive natural disaster in U.S. history, with costs exceeding $126 billion, with at least half of the cost the related flooding following the levy breaks (see http://useconomy.about.com/od/grossdomesticproduct/f/katrina_damage.htm.) It is not clear from this analysis whether those costs are included only in the hurricane event or split over the flood events too.
An interesting follow up to this analysis would be to find out how hurricane and flood events were reported, to see if hurricanes might be a root cause of flood reporting.
NOAA Website - http://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype
Event Cross Reference - http://rpubs.com/jlutzwpi/56060
Data Location - “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”
Coursera Project Information - https://www.coursera.org/learn/reproducible-research/peer/OMZ37/course-project-2
Katrina Costs: http://useconomy.about.com/od/grossdomesticproduct/f/katrina_damage.htm