Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Load the data.
setwd("~/Coursera/Reproducible Research/Project/Project2")
datadir <- "~/Coursera/Reproducible Research/Project/Project2"
noaaData<-read.csv(gzfile("repdata-data-StormData.csv.bz2"))
There are 902297 observations in this data. Load required Libraries
library(gdata)
## gdata: Unable to locate valid perl interpreter
## gdata:
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata:
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
##
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
##
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
##
## Attaching package: 'gdata'
##
## The following object is masked from 'package:stats':
##
## nobs
##
## The following object is masked from 'package:utils':
##
## object.size
library(plyr)
Clean the Data
The EVTYPE, PROPDMGEXP and CROPDMGEXP are inconsistent with regards to case so I am going to convert them all to uppercase.
There are also a lot of EVTYPE values that are the same but have different spellings or other numbers/characters as well as the correct value so I am going to correct those too.
# Convert EVTYPE to all uppercase goes from 985 to 898 distinct EVTYPE values
noaaData$EVTYPE <- toupper(noaaData$EVTYPE)
# Convert the damage codes to upercase too for use when calculaing the economic cost
noaaData$PROPDMGEXP <- toupper(noaaData$PROPDMGEXP)
noaaData$CROPDMGEXP <- toupper(noaaData$CROPDMGEXP)
# Time leading and training whiespace goes from 898 to 890 distinct EVTYPE Vals
noaaData$EVTYPE <- trim(noaaData$EVTYPE)
# Try to correct obvious spelling/typing/classification mistakes
#Goes from 890 to 408 distinct EVTYPE vals
evtypelist <-as.character(noaaData$EVTYPE)
evtypelist <- gsub(".*THUNDERSTORM.*", "THUNDERSTORM", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*TSTM.*", "THUNDERSTORM", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*TORNADO.*", "TORNADO", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*HURRICANE.*", "HURRICANE", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*HEAT.*", "HEAT", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*WARM.*", "WARM", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SNOW.*", "SNOW", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*COLD.*", "COLD", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*COOL.*", "COLD", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*RAIN.*", "RAIN", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SURF.*", "SURF", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*FLOOD.*", "FLOOD", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*RISING WATER.*", "FLOOD", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*AVALAN.*", "AVALANCHE", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*MUDSLIDE.*", "MUDSLIDE", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*LANDSLIDE.*", "LANDSLIDE", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*FIRE.*", "FIRE", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SEAS.*", "SEA", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SWELLS.*", "SEA", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*WAVE.*", "SEA", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*BEACH.*", "BEACH", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*BITTER.*", "COLD", evtypelist, ignore.case=TRUE)
noaaData$EVTYPE <- evtypelist
There are still 902297 observations in the data
Lets add a new field containing the total economic cost of each of the events. This can be calculated bye adding together the crop damage
#I need to calculate the cost of those events that are due to damage using a function
damage <- function(a, exp) {
if (any(exp == "B"))
return (a * 1000000000)
else if (exp == "M")
return (a * 1000000)
else if (exp == "H")
return (a * 100000)
else if (exp == "K")
return (a * 1000)
else
return (1)
}
# Economic damage is caused by property and crop damage added together
noaaData$total_cost<-(damage(noaaData$PROPDMG, noaaData$PROPDMGEXP) + damage(noaaData$CROPDMG, noaaData$CROPDMGEXP))
Limit Data to those with either a positive injury or fatality value
healthimpactData <- noaaData[noaaData$FATALITIES > 0 | noaaData$INJURIES > 0,]
healthimpactData$totImpact <- healthimpactData$FATALITIES + healthimpactData$INJURIES
Now there are 21929 observations.
Lets try to summarise that data and sort it.
healthimpind<-ddply(healthimpactData, c("EVTYPE"), summarise, N=length(totImpact),
total_impact=sum(totImpact),
average_impact=mean(totImpact),
median_impact=median(totImpact)
)
healthimpind<-healthimpind[order(-healthimpind$total_impact),]
Now there are 88 observations. Subset the health impact index to only contain the top 10
t10_health_impact<-healthimpind[1:10,]
Going back to the economic cost letslimit Data to those with a positive economic cost
costimpactData <- noaaData[noaaData$total_cost >0 ,]
Now there are 245031 observations.
Lets try to summarise that data and sort it.
costimpind<-ddply(costimpactData, c("EVTYPE"), summarise, N=length(total_cost),
total_impact=sum(total_cost),
average_impact=mean(total_cost),
median_impact=median(total_cost)
)
costimpind<-arrange(costimpind,desc(total_impact))
Now there are 160 observations. Subset the cost impact index to only contain the top 10
t10_cost_impact<-costimpind[1:10,]
To answer the first question which was across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
The top events for causing either a fatality or an injury are:
par(cex.main=1.25, cex=0.75)
barplot(t10_health_impact$total_impact,
main="Total population health harm per event type",
ylab= "Persons harmed",
xlab="Event type",
col = c(1,2,3,4,5,6,7,8,9,10),
legend = t10_health_impact$EVTYPE)
So as you can see the Tornado is the event that by far causes the most fatalies/injuries to people in the US.
To answer the second question which was across the United States, which types of events have the greatest economic consequences?
The top costing events are:
par(cex.main=1.25, cex=0.75)
barplot(t10_cost_impact$total_impact,
main="Total economic consequence per event type",
ylab= "cost ($)",
xlab="Event type",
col = c(1,2,3,4,5,6,7,8,9,10),
legend = t10_cost_impact$EVTYPE)
So as you can see once again the Tornado is the event that causes the highest economic consequences although Thunderstorms and Floods are also quite high.
I think that more time needs to be spent understanding the causes of these events and if there is any mitigation to reduce both fatalities/injuries and economic costs. A cost benefit analysis could be performed against suggested mitigations in order to see areas that investment could be focussed.
In order to get more accurate data to do better analysis the earlier years when the dataset population was inconsistent could also be removed. This would then also take into account the fact that building codes, safety procedures and farming practices are different in more modern times.