The purpose of this report is to identify the types of storm events that are most harmful to the population health within the United States and to identify the types of storm events that cause the greatest economic cost using the National Oceanic and Atmospheric Administration (NOAA) storm dataset.
[National Weather Service Instruction 10-165] (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) is available for reference.
The dataset contains 60 years of storm events recorded between 1951 and 2011. Free-text input of data, directive changes, and the merging of data from individual counties and states has created many inconsistencies, especially in the categorization of storm event types EVTYPE. Storm events recorded in the last twenty years account for approximately 80% of the data and is considered to be a more reliable source for analysis.
Storm event data was analyzed for the time period between November 1991 and November 2011. It was found that excessive heat and tornados are the most dangerous to health, causing thousands of fatalities and injuries. Flood has the greatest economic damage to crops, while ice storm has the greatest economic impact on property.
Load Libraries
library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
The NOAA storm dataset may be downloaded using the following command.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
StormData.csv.bz2 must be in the working directory before unzipping and reading the dataset.
This is a large dataset and may take some time to load. To reduce the volume of data, the first row will be read to determine the appropriate columns of interest.
colData <- read.csv(bzfile("StormData.csv.bz2"), nrows=1, header=TRUE)
names(colData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The columns of interest are:
Examine the class names and use those to select and discard columns from the data set file
classNames = data.frame(lapply(colData[names(colData)], class))
classNames
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 numeric factor integer factor numeric factor factor factor
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 numeric logical logical logical logical numeric logical
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 numeric logical logical numeric numeric integer numeric numeric
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 1 numeric numeric factor numeric logical logical logical
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 logical numeric numeric numeric numeric logical numeric
Discard unnecessary columns by setting the column classes to NULL thereby reducing the volume of data.
stormData <- read.csv(bzfile("StormData.csv.bz2"), header=TRUE, colClasses = c(rep("NULL", 1),rep("factor", 1), rep("NULL", 3),rep("NULL",2), rep("factor", 1), rep("NULL", 14), rep("numeric",3), rep("factor",1), rep("numeric",1), rep("character",1), rep("NULL",9) ) )
View the first 6 rows of data
head(stormData)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 4/18/1950 0:00:00 TORNADO 0 15 25.0 K
## 2 4/18/1950 0:00:00 TORNADO 0 0 2.5 K
## 3 2/20/1951 0:00:00 TORNADO 0 2 25.0 K
## 4 6/8/1951 0:00:00 TORNADO 0 2 2.5 K
## 5 11/15/1951 0:00:00 TORNADO 0 2 2.5 K
## 6 11/15/1951 0:00:00 TORNADO 0 6 2.5 K
## CROPDMG CROPDMGEXP
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Rename the columns to be more meaningful and in a standard R format
names(stormData) <- c("begin.date", "event.type", "fatalities", "injuries", "property.damage", "property.damage.unit", "crop.damage", "crop.damage.unit")
Convert the date to POSIX format yyyy-mm-dd
stormData$begin.date <- as.character(as.POSIXlt(strptime(stormData$begin.date, format="%m/%d/%Y")))
Subset the data for the time-period between November 30, 1991 and November 30, 2011.
twentyYearsData <- with(stormData, subset(stormData, begin.date >= "1991-11-30" & begin.date <= "2011-11-30", row.names=NULL))
Calculate the percentage of observations for the twenty year period.
twentyYearPercentageOfObservations <- (nrow(twentyYearsData)/nrow(stormData)) * 100
| Number of Years | Percentange of Observations |
|---|---|
| 20 | 80.7212038 |
The 20 year data set contains 80% of the observations and should have better consistency and be more reliable than analysis on the complete data set.
The NOAA guide categorizes 48 storm event types EVTYPE. The dataset contains 985 event types. A brief manual examination of the recorded event types reveals that many event types cannot easily be determined as belonging to a single valid type. To avoid subjective interpretation, EVTYPE, will only be cleaned for case sensitivity, pluralization, and obvious mis-categorization that affects the results of the analysis.
This reduces the number of event types from 220 to 205 when aggregated to healthDataByEvent
twentyYearsData$event.type <- toupper(twentyYearsData$event.type)
Remove the S from the last character - this further reduces from 205 to 192 categories
twentyYearsData$event.type <- gsub("S$", "",twentyYearsData$event.type)
Replace instances of tstm wind with NOAA defined storm event type thunderstorm wind
twentyYearsData$event.type <- gsub("TSTM WIND", "THUNDERSTORM WIND", twentyYearsData$event.type)
Replace all variations of hurricane with NOAA defined storm event type hurricane
twentyYearsData$event.type <- gsub("HURRICANE.*", "HURRICANE", twentyYearsData$event.type)
River Flood will be converted to Flood
twentyYearsData$event.type <- gsub("RIVER.*", "FLOOD", twentyYearsData$event.type)
Storm Surge will be converted to Storm Surge/Tide
twentyYearsData$event.type <- gsub("^STORM SURGE$", "STORM SURGE/TIDE", twentyYearsData$event.type)
Wild/Forest and Wild Fire Fire will be conveted to wildfire
twentyYearsData$event.type <- gsub("WILD/FOREST.*", "WILDFIRE", twentyYearsData$event.type)
twentyYearsData$event.type <- gsub("WILD FIRE", "WILDFIRE", twentyYearsData$event.type)
Variations of Tropical Storm will be converted to Tropical Storm
twentyYearsData$event.type <- gsub("TROPICAL STORM.*", "TROPICAL STORM", twentyYearsData$event.type)
Some events have been conglomerated as Tornadoes, Thunderstorm Wind, Hail in to a single category. This caused notable damage resulting in 25 fatalities, 1.6 billion dollars in property damage, and 2.5 million dollars in crop damage. This case will be removed from the data as it is not categorized as a single event.
grep("TORNADOES,.*", twentyYearsData$event.type, perl=TRUE)
## [1] 21049
Remove case 21049 event.type Tornadoes, TSTM Wind, Hail
twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="TORNADOES, TSTM WIND, HAIL"),]
twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="TORNADOES, THUNDERSTORM WIND, HAIL"),]
Heavy Rain/Severe Weather will be dropped as it is not clear what this equates to - however an entry caused 2.5 billion dollars of property damage and may warrant further investigation.
grep("HEAVY RAIN/SEVERE.*", twentyYearsData$event.type, perl=TRUE)
## [1] 33222 38104
Remove cases HEAVY RAIN/SEVERE WEATHER
twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="HEAVY RAIN/SEVERE WEATHER"),]
SEVERE THUNDERSTORM could mean wind, hail, lightning - as it is not clear remove these rows
twentyYearsData <- twentyYearsData[!(twentyYearsData$event.type=="SEVERE THUNDERSTORM"),]
Plots of the top five storm events will be created. In order to plot for more valid events as defined by NOAA, the data will need futher cleaning.
Convert to millions of dollars
twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "H" | crop.damage.unit == "H")] <- 100/1000000
twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "K" | crop.damage.unit == "K")] <- 1000/1000000
twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "M" | crop.damage.unit == "M")] <- 1
twentyYearsData$multiplier[with(twentyYearsData, property.damage.unit == "B" | crop.damage.unit == "B")] <- 1000
Subset the 20 year data for all fatalities OR injuries that have more than zero occurrences. This is valid as we are summing the number of occurrences and not taking other statistical measures.
healthData <- with(twentyYearsData, subset(twentyYearsData, fatalities > 0 | injuries > 0, select=c(begin.date,event.type, fatalities,injuries)))
str(healthData)
## 'data.frame': 15152 obs. of 4 variables:
## $ begin.date: chr "1992-03-09" "1992-03-09" "1992-03-10" "1992-03-10" ...
## $ event.type: chr "TORNADO" "THUNDERSTORM WIND" "TORNADO" "THUNDERSTORM WIND" ...
## $ fatalities: num 0 1 2 0 0 1 0 0 0 0 ...
## $ injuries : num 1 1 7 7 1 1 2 1 7 5 ...
healthDataByEvent <- healthData
healthDataByEvent$begin.date <- NULL
healthDataByEvent <- aggregate(cbind(injuries, fatalities) ~ event.type,healthDataByEvent, FUN=sum)
Injuries Get the top 5 results
injuries <- head(healthDataByEvent[order(-healthDataByEvent$injuries),], 5)
Drop the fatalities column
injuries$fatalities <- NULL
injuries$healthType <- factor("injuries")
colnames(injuries) <- c("event.type","count", "health.type")
Fatalities
fatalities <- head(healthDataByEvent[order(-healthDataByEvent$fatalities),], 5)
drop the injuries column
fatalities$injuries <- NULL
fatalities$healthType <- factor("fatalities")
colnames(fatalities) <- c("event.type","count", "health.type")
Combine injuries and fatalities in to a new dataframe
topHealthEvents <- rbind(injuries, fatalities)
Property
propertyData <- with(twentyYearsData, subset(twentyYearsData, property.damage > 0, select=c(begin.date,event.type,property.damage, property.damage.unit, multiplier)))
propertyData$actual.amount[propertyData$property.damage>0] <- propertyData$property.damage * propertyData$multiplier
propertyTotalsbyEvent <- aggregate(propertyData$actual.amount, by=list(propertyData$event.type), FUN=sum)
names(propertyTotalsbyEvent) <- c("event.type", "actual.amount")
propertyTotalsbyEvent <- head(propertyTotalsbyEvent[order(-propertyTotalsbyEvent$actual.amount),],5)
Crops
cropData <- with(twentyYearsData, subset(twentyYearsData, crop.damage > 0, select=c(begin.date,event.type,crop.damage, crop.damage.unit, multiplier)))
cropData$actual.amount[cropData$crop.damage>0] <- cropData$crop.damage * cropData$multiplier
cropTotalsbyEvent <- aggregate(cropData$actual.amount, by=list(cropData$event.type), FUN=sum)
names(cropTotalsbyEvent) <- c("event.type", "actual.amount")
cropTotalsbyEvent <- head(cropTotalsbyEvent[order(-cropTotalsbyEvent$actual.amount),],5)
The plots have been created for the top five storm events that cause the most injuries,fatalities, property damage, and crop damage.
Health - Number of Injuries and Fatalities
ggplot(topHealthEvents, aes(event.type, count), fill=topHealthEvents$event.type) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), legend.title=element_blank()) + facet_wrap(~health.type, ncol=2, scales="free" ) + xlab("Event Type") + ylab("Count") + ggtitle("Health Events (Injuries and Fatalities) Due To Storm Events")
Figure 1 - Storm events that caused the most injuries and fatalities between November 1991 - November 2011
Tornados have caused the most injuries (nearly 25,000) and more than 1,500 fatalities in the United States during this twenty year period. Excessive Heat causes the most fatalties (over 1,750), and is attributed to more than 5,000 injuries. Flash Floods, Floods, Heat, and lightning are also attributed to events that cause the most injuries and fatalities.
NOAA states that Property and Crop damage should be considered as a broad estimate. Further analysis is encouraged to determine the cause of outliers in theses results.
Property Damages
ggplot(propertyTotalsbyEvent,aes(factor(propertyTotalsbyEvent$event.type),log10(propertyTotalsbyEvent$actual.amount),fill=propertyTotalsbyEvent$event.type)) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),legend.title=element_blank()) + xlab("Event Type") + ylab("Damages (Scaled by log10)") + ggtitle("Property Damages Due To Storm Events")
Figure 2 - Storm events that caused the most property damage between November 1991 - November 2011
The dollar amount (y axis) in figure 2 has been scaled by log10 as the “Ice Storm” event has multiple times the amount of the other events. The other events, therefore, cannot be viewed properly in the graph.
Ice Storm is the event that has caused the most property damage. As the dollar value is extremely high compared to the other events further analysis may be necessary to confirm the validity of this data. Storm Surge/Tide, Wildfire, Drought, and Tropical Storm are also attributed to events that cause the most property damage.
The dataframe will be printed to provide the dollar amounts.
NB. Dollar amounts must be multiplied by 100000 to obtain the actual amount.
propertyTotalsbyEvent
## event.type actual.amount
## 161 ICE STORM 504069.353
## 233 STORM SURGE/TIDE 47964.724
## 317 WILDFIRE 10477.576
## 296 TROPICAL STORM 9907.595
## 32 DROUGHT 2196.954
Crop Damages
ggplot(cropTotalsbyEvent,aes(factor(cropTotalsbyEvent$event.type),cropTotalsbyEvent$actual.amount,fill=cropTotalsbyEvent$event.type)) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),legend.title=element_blank()) + xlab("Event Type") + ylab("Damages (Millions of Dollars)") + ggtitle("Crop Damages Due To Storm Events")
Figure 3 - Storm events that caused the most crop damage between November 1991 - November 2011
Flood events cause the most damage to crops at nearly a trillion dollars. Flash Flood, Tornado, Thunderstorm Wind, and Drought are also attributed to events that the most crop damage.