Synopsis

The purpose of this report is to identify the types of storm events that are most harmful to the population health within the United States and to identify the types of storm events that cause the greatest economic cost using the National Oceanic and Atmospheric Administration (NOAA) storm dataset.
[National Weather Service Instruction 10-165] (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) is available for reference.

The dataset contains 60 years of storm events recorded between 1951 and 2011. Free-text input of data, directive changes, and the merging of data from individual counties and states has created many inconsistencies, especially in the categorization of storm event types EVTYPE. Storm events recorded in the last twenty years account for approximately 80% of the data and is considered to be a more reliable source for analysis.

Storm event data was analyzed for the time period between November 1991 and November 2011. It was found that excessive heat and tornados are the most dangerous to health, causing thousands of fatalities and injuries. Flood has the greatest economic damage to crops, while ice storm has the greatest economic impact on property.

Data Processing

Load Libraries

library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")

The NOAA storm dataset may be downloaded using the following command.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")  

StormData.csv.bz2 must be in the working directory before unzipping and reading the dataset.
This is a large dataset and may take some time to load. To reduce the volume of data, the first row will be read to determine the appropriate columns of interest.

colData <- read.csv(bzfile("StormData.csv.bz2"), nrows=1, header=TRUE)
names(colData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The columns of interest are:

Examine the class names and use those to select and discard columns from the data set file

classNames = data.frame(lapply(colData[names(colData)], class))
classNames
##   STATE__ BGN_DATE BGN_TIME TIME_ZONE  COUNTY COUNTYNAME  STATE EVTYPE
## 1 numeric   factor  integer    factor numeric     factor factor factor
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1   numeric logical    logical  logical  logical    numeric    logical
##   END_RANGE END_AZI END_LOCATI  LENGTH   WIDTH       F     MAG FATALITIES
## 1   numeric logical    logical numeric numeric integer numeric    numeric
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP     WFO STATEOFFIC
## 1  numeric numeric     factor numeric    logical logical    logical
##   ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS  REFNUM
## 1   logical  numeric   numeric    numeric    numeric logical numeric

Discard unnecessary columns by setting the column classes to NULL thereby reducing the volume of data.

stormData <- read.csv(bzfile("StormData.csv.bz2"), header=TRUE, colClasses = c(rep("NULL", 1),rep("factor", 1), rep("NULL", 3),rep("NULL",2), rep("factor", 1), rep("NULL", 14), rep("numeric",3), rep("factor",1), rep("numeric",1), rep("character",1), rep("NULL",9) ) )

View the first 6 rows of data

head(stormData)
##             BGN_DATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1  4/18/1950 0:00:00 TORNADO          0       15    25.0          K
## 2  4/18/1950 0:00:00 TORNADO          0        0     2.5          K
## 3  2/20/1951 0:00:00 TORNADO          0        2    25.0          K
## 4   6/8/1951 0:00:00 TORNADO          0        2     2.5          K
## 5 11/15/1951 0:00:00 TORNADO          0        2     2.5          K
## 6 11/15/1951 0:00:00 TORNADO          0        6     2.5          K
##   CROPDMG CROPDMGEXP
## 1       0           
## 2       0           
## 3       0           
## 4       0           
## 5       0           
## 6       0

Rename the columns to be more meaningful and in a standard R format

names(stormData) <- c("begin.date", "event.type", "fatalities", "injuries", "property.damage", "property.damage.unit", "crop.damage", "crop.damage.unit")

Convert the date to POSIX format yyyy-mm-dd

stormData$begin.date <- as.character(as.POSIXlt(strptime(stormData$begin.date, format="%m/%d/%Y")))

Subset the data for the time-period between November 30, 1991 and November 30, 2011.

twentyYearsData <- with(stormData, subset(stormData, begin.date >= "1991-11-30" & begin.date <= "2011-11-30", row.names=NULL))

Calculate the percentage of observations for the twenty year period.

twentyYearPercentageOfObservations <- (nrow(twentyYearsData)/nrow(stormData)) * 100
Number of Years Percentange of Observations
20 80.7212038

The 20 year data set contains 80% of the observations and should have better consistency and be more reliable than analysis on the complete data set.

The NOAA guide categorizes 48 storm event types EVTYPE. The dataset contains 985 event types. A brief manual examination of the recorded event types reveals that many event types cannot easily be determined as belonging to a single valid type. To avoid subjective interpretation, EVTYPE, will only be cleaned for case sensitivity, pluralization, and obvious mis-categorization that affects the results of the analysis.

This reduces the number of event types from 220 to 205 when aggregated to healthDataByEvent

twentyYearsData$event.type <- toupper(twentyYearsData$event.type)

Remove the S from the last character - this further reduces from 205 to 192 categories

twentyYearsData$event.type <- gsub("S$", "",twentyYearsData$event.type)

Replace instances of tstm wind with NOAA defined storm event type thunderstorm wind

twentyYearsData$event.type <- gsub("TSTM WIND", "THUNDERSTORM WIND", twentyYearsData$event.type)

Replace all variations of hurricane with NOAA defined storm event type hurricane

twentyYearsData$event.type <- gsub("HURRICANE.*", "HURRICANE", twentyYearsData$event.type)

River Flood will be converted to Flood

twentyYearsData$event.type <- gsub("RIVER.*", "FLOOD", twentyYearsData$event.type)

Storm Surge will be converted to Storm Surge/Tide

twentyYearsData$event.type <- gsub("^STORM SURGE$", "STORM SURGE/TIDE", twentyYearsData$event.type)

Wild/Forest and Wild Fire Fire will be conveted to wildfire

twentyYearsData$event.type <- gsub("WILD/FOREST.*", "WILDFIRE", twentyYearsData$event.type)
twentyYearsData$event.type <- gsub("WILD FIRE", "WILDFIRE", twentyYearsData$event.type)

Variations of Tropical Storm will be converted to Tropical Storm

twentyYearsData$event.type <- gsub("TROPICAL STORM.*", "TROPICAL STORM", twentyYearsData$event.type)

Some events have been conglomerated as Tornadoes, Thunderstorm Wind, Hail in to a single category. This caused notable damage resulting in 25 fatalities, 1.6 billion dollars in property damage, and 2.5 million dollars in crop damage. This case will be removed from the data as it is not categorized as a single event.

grep("TORNADOES,.*", twentyYearsData$event.type, perl=TRUE)
## [1] 21049

Remove case 21049 event.type Tornadoes, TSTM Wind, Hail

twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="TORNADOES, TSTM WIND, HAIL"),]
twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="TORNADOES, THUNDERSTORM WIND, HAIL"),]

Heavy Rain/Severe Weather will be dropped as it is not clear what this equates to - however an entry caused 2.5 billion dollars of property damage and may warrant further investigation.

grep("HEAVY RAIN/SEVERE.*", twentyYearsData$event.type, perl=TRUE)
## [1] 33222 38104

Remove cases HEAVY RAIN/SEVERE WEATHER

twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="HEAVY RAIN/SEVERE WEATHER"),]

SEVERE THUNDERSTORM could mean wind, hail, lightning - as it is not clear remove these rows

twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="SEVERE THUNDERSTORM"),]

Plots of the top five storm events will be created. In order to plot for more valid events as defined by NOAA, the data will need futher cleaning.

Convert to millions of dollars

twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "H" | crop.damage.unit == "H")] <- 100/1000000
twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "K" | crop.damage.unit == "K")] <- 1000/1000000
twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "M" | crop.damage.unit == "M")] <- 1
twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "B" | crop.damage.unit == "B")] <- 1000

Subset the 20 year data for all fatalities OR injuries that have more than zero occurrences. This is valid as we are summing the number of occurrences and not taking other statistical measures.

Health Data

healthData <- with(twentyYearsData, subset(twentyYearsData, fatalities > 0 | injuries > 0, select=c(begin.date,event.type, fatalities,injuries)))
str(healthData)
## 'data.frame':    15152 obs. of  4 variables:
##  $ begin.date: chr  "1992-03-09" "1992-03-09" "1992-03-10" "1992-03-10" ...
##  $ event.type: chr  "TORNADO" "THUNDERSTORM WIND" "TORNADO" "THUNDERSTORM WIND" ...
##  $ fatalities: num  0 1 2 0 0 1 0 0 0 0 ...
##  $ injuries  : num  1 1 7 7 1 1 2 1 7 5 ...
healthDataByEvent <- healthData
healthDataByEvent$begin.date <- NULL
healthDataByEvent <- aggregate(cbind(injuries, fatalities) ~ event.type,healthDataByEvent, FUN=sum)

Injuries Get the top 5 results

injuries <- head(healthDataByEvent[order(-healthDataByEvent$injuries),], 5)

Drop the fatalities column

injuries$fatalities <- NULL
injuries$healthType <- factor("injuries")
colnames(injuries) <- c("event.type","count", "health.type")

Fatalities

fatalities <- head(healthDataByEvent[order(-healthDataByEvent$fatalities),], 5)

drop the injuries column

fatalities$injuries <- NULL
fatalities$healthType <- factor("fatalities")
colnames(fatalities) <- c("event.type","count", "health.type")

Combine injuries and fatalities in to a new dataframe

topHealthEvents <- rbind(injuries, fatalities)

Damage Data

Property

propertyData <- with(twentyYearsData, subset(twentyYearsData, property.damage > 0, select=c(begin.date,event.type,property.damage, property.damage.unit, multiplier)))
propertyData$actual.amount[propertyData$property.damage>0] <- propertyData$property.damage * propertyData$multiplier
propertyTotalsbyEvent <- aggregate(propertyData$actual.amount, by=list(propertyData$event.type), FUN=sum)
names(propertyTotalsbyEvent) <- c("event.type", "actual.amount")
propertyTotalsbyEvent <- head(propertyTotalsbyEvent[order(-propertyTotalsbyEvent$actual.amount),],5)

Crops

cropData <- with(twentyYearsData, subset(twentyYearsData, crop.damage > 0, select=c(begin.date,event.type,crop.damage, crop.damage.unit, multiplier)))
cropData$actual.amount[cropData$crop.damage>0] <- cropData$crop.damage * cropData$multiplier
cropTotalsbyEvent <- aggregate(cropData$actual.amount, by=list(cropData$event.type), FUN=sum)
names(cropTotalsbyEvent) <- c("event.type", "actual.amount")
cropTotalsbyEvent <- head(cropTotalsbyEvent[order(-cropTotalsbyEvent$actual.amount),],5)

Results

The plots have been created for the top five storm events that cause the most injuries,fatalities, property damage, and crop damage.

Health - Number of Injuries and Fatalities

ggplot(topHealthEvents, aes(event.type, count), fill=topHealthEvents$event.type) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), legend.title=element_blank()) + facet_wrap(~health.type, ncol=2, scales="free" ) + xlab("Event Type") + ylab("Count") + ggtitle("Health Events (Injuries and Fatalities) Due To Storm Events")

Figure 1 - Storm events that caused the most injuries and fatalities between November 1991 - November 2011

Tornados have caused the most injuries (nearly 25,000) and more than 1,500 fatalities in the United States during this twenty year period. Excessive Heat causes the most fatalties (over 1,750), and is attributed to more than 5,000 injuries. Flash Floods, Floods, Heat, and lightning are also attributed to events that cause the most injuries and fatalities.

NOAA states that Property and Crop damage should be considered as a broad estimate. Further analysis is encouraged to determine the cause of outliers in theses results.

Property Damages

ggplot(propertyTotalsbyEvent,aes(factor(propertyTotalsbyEvent$event.type),log10(propertyTotalsbyEvent$actual.amount),fill=propertyTotalsbyEvent$event.type)) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),legend.title=element_blank()) + xlab("Event Type") + ylab("Damages (Scaled by log10)") + ggtitle("Property Damages Due To Storm Events")

Figure 2 - Storm events that caused the most property damage between November 1991 - November 2011

The dollar amount (y axis) in figure 2 has been scaled by log10 as the “Ice Storm” event has multiple times the amount of the other events. The other events, therefore, cannot be viewed properly in the graph.

Ice Storm is the event that has caused the most property damage. As the dollar value is extremely high compared to the other events further analysis may be necessary to confirm the validity of this data. Storm Surge/Tide, Wildfire, Drought, and Tropical Storm are also attributed to events that cause the most property damage.

The dataframe will be printed to provide the dollar amounts.
NB. Dollar amounts must be multiplied by 100000 to obtain the actual amount.

propertyTotalsbyEvent
##           event.type actual.amount
## 161        ICE STORM    504069.353
## 233 STORM SURGE/TIDE     47964.724
## 317         WILDFIRE     10477.576
## 296   TROPICAL STORM      9907.595
## 32           DROUGHT      2196.954

Crop Damages

ggplot(cropTotalsbyEvent,aes(factor(cropTotalsbyEvent$event.type),cropTotalsbyEvent$actual.amount,fill=cropTotalsbyEvent$event.type)) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),legend.title=element_blank()) + xlab("Event Type") + ylab("Damages (Millions of Dollars)") + ggtitle("Crop Damages Due To Storm Events")

Figure 3 - Storm events that caused the most crop damage between November 1991 - November 2011

Flood events cause the most damage to crops at nearly a trillion dollars. Flash Flood, Tornado, Thunderstorm Wind, and Drought are also attributed to events that the most crop damage.