Severe weather events can cause both public health and economic problems for communities and municipalities. Many of them can result in fatalities, injuries, and property damage. Preventing such outcomes to the extent possible is a key concern.
NOAA storm database (http://www.ncdc.noaa.gov/stormevents/) tracks characteristics of major storms and weather events in the United States. In it is included when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The goal of this analysis is to explore the NOAA Storm Database and answer the following questions about severe weather events:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This report can help public managers, who are responsible for preparing for severe weather events, and will need to prioritize resources for different types of events.
The data for this analysis come in the form of a comma-separated-value file compressed by the bzip2 algorithm. The file is avaiable from the web and it is size is 47Mb.
The documentation of the database is available from National Weather Service Storm Data Documentation web site and there is a FAQ too. The events in the file start in the year 1950 and end in November 2011. In the earlier years there are generally fewer events recorded. More recent years should be considered more complete.
For reading the data, it was used the packages “R.util”, lubridate,ggplot2, plyr and pander, that are not default in R. Then, it was required to initialize them:
## packages required
require(R.utils)
## Loading required package: R.utils
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
require(lubridate)
## Loading required package: lubridate
require(ggplot2)
## Loading required package: ggplot2
require(plyr)
## Loading required package: plyr
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:lubridate':
##
## here
require(pander)
## Loading required package: pander
##
## Attaching package: 'pander'
##
## The following object is masked from 'package:R.utils':
##
## wrap
After that, the file was downloaded and their data stored in the dataframe “noaaStormData”:
## reading file
urlDataFile <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
compDataFile <- "repdata-data-StormData.csv.bz2"
dataFile <- "repdata-data-StormData.csv"
if (!file.exists(dataFile)) {
download.file(urlDataFile,
destfile = compDataFile)
bunzip2(compDataFile,
destname = dataFile,
overwrite = T, remove = F)
}
noaaStormData <- read.csv(dataFile)
Here is the first record od the data frame “noaaStormData”:
## Frist database records
head(noaaStormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Analyzing the values of EVTYPE variable, there was a big amount of event types:
## Amount of event types
length(unique(noaaStormData$EVTYPE))
## [1] 985
Then it was noticed that it was necessary to perform several transformations for making the data tidy. For example, the string “COAST” is present in the following strings (upper case, low case, extra spaces, etc):
## Example of several types of values for "COAST"
sort(unique(grep('COAST', noaaStormData$EVTYPE, ignore.case = TRUE, value = TRUE)))
## [1] " COASTAL FLOOD" "BEACH EROSION/COASTAL FLOOD"
## [3] "COASTAL FLOODING/EROSION" "COASTAL EROSION"
## [5] "Coastal Flood" "COASTAL FLOOD"
## [7] "coastal flooding" "Coastal Flooding"
## [9] "COASTAL FLOODING" "COASTAL FLOODING/EROSION"
## [11] "Coastal Storm" "COASTAL STORM"
## [13] "COASTAL SURGE" "COASTAL/TIDAL FLOOD"
## [15] "COASTALFLOOD" "COASTALSTORM"
## [17] "HEAVY SURF COASTAL FLOODING" "HIGH WINDS/COASTAL FLOOD"
Therefore the following data transformations were made in the variable EVTYPE:
## replace blanks by space and put strings to upper case
noaaStormData$EVTYPE <- gsub("^ *", "",
toupper(as.character(noaaStormData$EVTYPE)))
noaaStormData$EVTYPE <- gsub("* *", " ",
toupper(as.character(noaaStormData$EVTYPE)))
## replace event types to summarizing them
noaaStormData$EVTYPE <- gsub("(^+|.+|)FLOOD(|+.+$)", "FLOOD",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HEAT(|+.+$)", "HEAT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)WARM(|+.+$)", "HEAT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HOT(|+.+$)", "HEAT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)SNOW(|+.+$)", "SNOW",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)ICE(|+.+$)", "SNOW",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)WIND(|+.+$)", "WIND",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HAIL(|+.+$)", "HAIL",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)RAIN(|+.+$)", "RAIN",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HURRICANE(|+.+$)", "HURRICANE",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)TORNADO(|+.+$)", "TORNADO",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)THUNDERSTORM(|+.+$)", "THUNDERSTORM",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)STORM(|+.+$)", "THUNDERSTORM",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)WATERSPOUT(|+.+$)", "WATERSPOUT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)SUMMARY(|+.+$)", "SUMMARY",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)DRY(|+.+$)", "DRY",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)CURRENT(|+.+$)", "RIP CURRENT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)LIGHTNING(|+.+$)", "LIGHTNING",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)COLD(|+.+$)", "COLD",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)FREEZ(|+.+$)", "COLD",noaaStormData$EVTYPE)
This way, the amount of event types decreased:
## amount of event types
length(unique(noaaStormData$EVTYPE))
## [1] 191
There is information that in the earlier years of the database there are generally fewer events recorded and more recent years should be considered more complete. Therefore, it was necessary to analyze the amount of occurrences by year.
For this, it was created the variable “year” and analyzed the amount of occurrences by year:
## creating the variable "year"
noaaStormData$year <- year(strptime(noaaStormData$BGN_DATE, "%m/%d/%Y"))
This plot shows the amount of occurrences by year in the database:
## creating plot to show the amount of occurrences by year in the database
subYearAnalysis <- count(noaaStormData, vars = "year")
ggplot(subYearAnalysis, aes(year, freq) ) + geom_line() + ggtitle("NOAA Storm Database Occurrences by Year") + geom_point(data=subYearAnalysis[subYearAnalysis$year==1995,],size=5,colour="red")
Aiming a more uniform results of the analysis, we chose to consider only the data from 1995, the year that there is a greater increase in occurrences, as the following table:
## table with the frequency of occurrences form 1993 to 1998
subYearAnalysisTemp <- subYearAnalysis[subYearAnalysis$year > 1992,]
panderOptions('table.split.table', Inf)
set.caption('NOAA Storm Database frequency of occurrences from 1993 to 1998')
pander(head(subYearAnalysisTemp))
##
## ----------------------
## year freq
## -------- ------ ------
## **44** 1993 12607
##
## **45** 1994 20631
##
## **46** 1995 27970
##
## **47** 1996 32270
##
## **48** 1997 28680
##
## **49** 1998 38128
## ----------------------
##
## Table: NOAA Storm Database frequency of occurrences from 1993 to 1998
Then generated a subset of the database records only with the years from 1995 to the analyzes presented in this report.Furthermore, the subset of filtered data was only composed of the following columns, unique necessary to answer the questions addressed in this report:FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP and year of occurence of event type.
## creating new subset with main variables
noaaStormDataFinal <- noaaStormData[noaaStormData$year > 1994, c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "year" )]
Based on the subset of filtered data per year of events with only the necessary variables to the analysis, it is still necessary to make some changes to allow for the assessment of economic consequences.
The types of events with the greatest economic consequence can be calculate with the columns for economic damage on crops (CROPDMG) and for economic damage on properties (PROPDMG).Furthermore, the variables CROPDMGEXP and PROPDMGEXP had an alphabetic character signifying the magnitude of the variables CROPDMG and PROPDMG, respectivily : âKâ for thousands, âMâ for millions, and âBâ for billions. Then, it is necessary to create a new variable (economicDamage) to sum CROPDMG and PROPDMG adjusting the magnitude of values based on the variables CROPDMGEXP and PROPDMGEXP.
Before that, it is necessary to transform the values of CROPDMGEXP and PROPDMGEXP to uppercase variables because there is no uniformity in the values of these variables, as shown below:
# Values for CROPDMGEXP
unique(noaaStormDataFinal$CROPDMGEXP)
## [1] M m K B ? 0 k 2
## Levels: ? 0 2 B k K m M
# Values for PROPDMGEXP
unique(noaaStormDataFinal$PROPDMGEXP)
## [1] B M K m + 0 5 6 ? 4 2 3 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
Considering the values of CROPDMGEXP and PROPDMGEXP variables as the magnitude of CROPDMG and PROPDMG variables, it was necessary to make the following changes, regarding them as numeric values possible.
The following changes were processed in PROPDMGEXP variable
## Transform PROPDMGEXP variable
## Transform h or H into 2
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[hH]","2",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform k or K into 3
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[kK]","3",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform m or M into 6
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[mM]","6",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform b or B into 9
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[bB]","9",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform - or + or ? into 0
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[-\\+\\?]","0",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform empty strings to zero
noaaStormDataFinal$PROPDMGEXP[!nzchar(noaaStormDataFinal$PROPDMGEXP)] <- "0"
The following changes were processed in CROPDMGEXP variable:
## Transform CROPDMGEXP variable
# Transform h or H into 2
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[hH]","2",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform k or K into 3
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[kK]","3",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform m or M into 6
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[mM]","6",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform b or B into 9
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[bB]","9",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform - or + or ? into 0
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[-\\+\\?]","0",noaaStormDataFinal[,"CROPDMGEXP"])
# Transforempty strings to zero
noaaStormDataFinal$CROPDMGEXP[!nzchar(noaaStormDataFinal$CROPDMGEXP)] <- "0"
Thus, the variable economicDamage was obtained by the following formula: economicDamage = (CROPDMGx10^CROPDMGEXP) + (PROPDMGx10^PROPDMGEXP)
## Creating new subset data with economicDemage Variable
noaaStormDataFinal$economicDamage<- noaaStormDataFinal$CROPDMG * (10^as.numeric(noaaStormDataFinal$CROPDMGEXP))+
noaaStormDataFinal$PROPDMG * (10^as.numeric(noaaStormDataFinal$PROPDMGEXP))
The first records of the new subset data is the following:
## The first records of noaaStormDataFinal dataset
head(noaaStormDataFinal)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 187560 RAIN 0 0 0.0 0 0 0
## 187561 SNOW 0 0 0.0 0 0 0
## 187563 SNOW 0 0 0.0 0 0 0
## 187565 SNOW 0 0 0.0 0 0 0
## 187566 WIND 2 0 0.1 9 10 6
## 187575 HAIL 0 0 0.0 0 0 0
## year economicDamage
## 187560 1995 0.0e+00
## 187561 1995 0.0e+00
## 187563 1995 0.0e+00
## 187565 1995 0.0e+00
## 187566 1995 1.1e+08
## 187575 1995 0.0e+00
In order to show the main types of events are most harmful to population health, it is analyzed in this report on two aspects: amount of fatalities and amount of injuries.
Summarizing the amount of fatalities by type of events, it obtains the following graph:
fatalities = aggregate(FATALITIES ~ EVTYPE, data = noaaStormDataFinal, FUN = sum)
fatalities = fatalities[order(fatalities$FATALITIES, decreasing = T), ]
fatalities$EVTYPE <- factor(fatalities$EVTYPE, levels = fatalities$EVTYPE[order(fatalities$FATALITIES)])
p <- ggplot(fatalities[1:10,], aes(EVTYPE, FATALITIES, fill=EVTYPE))
p <- p + geom_bar(stat = "identity") + ylab("Amount of Fatalities") + xlab("Event Type")
p <- p + ggtitle("Types of Events Causing Fatalities Across the U.S (Top 10)")
p + theme(axis.text.x = element_text(angle=90, vjust=0.5))
#### Injuries
Summarizing the amount of injuries by type of events, it obtains the following graph:
injuries = aggregate(INJURIES ~ EVTYPE, data = noaaStormDataFinal, FUN = sum)
injuries = injuries[order(injuries$INJURIES, decreasing = T), ]
injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE[order(injuries$INJURIES)])
p <- ggplot(injuries[1:10,], aes(EVTYPE, INJURIES, fill=EVTYPE))
p <- p + geom_bar(stat = "identity") + ylab("Amount of Injuries") + xlab("Event Type")
p <- p + ggtitle("Types of Events Causing Injuries Across the U.S (Top 10)")
p + theme(axis.text.x = element_text(angle=90, vjust=0.5))
In both of cases (Fatalities and Injuries), the types of events most harmful with respect to population health are lightning, wind, flood, heat and tornado. Tornado caused more injuries and heat more fatalities.
In order to show the main types of events with the greatest economic impact, it is analyzed in this report the amount of economic damage to crops and economic damage on properties in US dollars. This amount was saved in economicDamage variable, considered the magnitude of the values , as described in “Data Cleaning Process - Damage variables” section.
Summarizing the variable economicDamage by type of event, it obtains the following chart:
ecoDamage = aggregate(economicDamage ~ EVTYPE, data = noaaStormDataFinal, FUN = sum)
ecoDamage = ecoDamage[order(ecoDamage$economicDamage, decreasing = T), ]
ecoDamage$EVTYPE <- factor(ecoDamage$EVTYPE, levels = ecoDamage$EVTYPE[order(ecoDamage$economicDamage)])
p <- ggplot(ecoDamage[1:10,], aes(EVTYPE, economicDamage, fill=EVTYPE))
p <- p + geom_bar(stat = "identity") + ylab("US$") + xlab("Event Type")
p <- p + ggtitle("Types of Events Causing Economic Damages - U.S. (Top 10)")
p + theme(axis.text.x = element_text(angle=90, vjust=0.5))
In conclusion, the types of events with the greatest economic impact are heat, tornado and flood. Heat is the main type of event that causes economic damage.