Synopsis

Severe weather events can cause both public health and economic problems for communities and municipalities. Many of them can result in fatalities, injuries, and property damage. Preventing such outcomes to the extent possible is a key concern.

NOAA storm database (http://www.ncdc.noaa.gov/stormevents/) tracks characteristics of major storms and weather events in the United States. In it is included when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The goal of this analysis is to explore the NOAA Storm Database and answer the following questions about severe weather events:

This report can help public managers, who are responsible for preparing for severe weather events, and will need to prioritize resources for different types of events.

Data Processing

Load Data Process

The data for this analysis come in the form of a comma-separated-value file compressed by the bzip2 algorithm. The file is avaiable from the web and it is size is 47Mb.

The documentation of the database is available from National Weather Service Storm Data Documentation web site and there is a FAQ too. The events in the file start in the year 1950 and end in November 2011. In the earlier years there are generally fewer events recorded. More recent years should be considered more complete.

For reading the data, it was used the packages “R.util”, lubridate,ggplot2, plyr and pander, that are not default in R. Then, it was required to initialize them:

## packages required
require(R.utils)
## Loading required package: R.utils
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
require(lubridate)
## Loading required package: lubridate
require(ggplot2)
## Loading required package: ggplot2
require(plyr)
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:lubridate':
## 
##     here
require(pander)
## Loading required package: pander
## 
## Attaching package: 'pander'
## 
## The following object is masked from 'package:R.utils':
## 
##     wrap

After that, the file was downloaded and their data stored in the dataframe “noaaStormData”:

## reading file
urlDataFile <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
compDataFile  <- "repdata-data-StormData.csv.bz2"
dataFile <- "repdata-data-StormData.csv"
if (!file.exists(dataFile)) {
  download.file(urlDataFile, 
        destfile = compDataFile)
    bunzip2(compDataFile, 
        destname = dataFile, 
        overwrite = T, remove = F)
}
noaaStormData <- read.csv(dataFile)

Here is the first record od the data frame “noaaStormData”:

## Frist database records
head(noaaStormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Data Cleaning Process

Event Type variable

Analyzing the values of EVTYPE variable, there was a big amount of event types:

## Amount of event types
length(unique(noaaStormData$EVTYPE))
## [1] 985

Then it was noticed that it was necessary to perform several transformations for making the data tidy. For example, the string “COAST” is present in the following strings (upper case, low case, extra spaces, etc):

## Example of several types of values for "COAST" 
sort(unique(grep('COAST', noaaStormData$EVTYPE, ignore.case = TRUE, value = TRUE)))
##  [1] " COASTAL FLOOD"              "BEACH EROSION/COASTAL FLOOD"
##  [3] "COASTAL  FLOODING/EROSION"   "COASTAL EROSION"            
##  [5] "Coastal Flood"               "COASTAL FLOOD"              
##  [7] "coastal flooding"            "Coastal Flooding"           
##  [9] "COASTAL FLOODING"            "COASTAL FLOODING/EROSION"   
## [11] "Coastal Storm"               "COASTAL STORM"              
## [13] "COASTAL SURGE"               "COASTAL/TIDAL FLOOD"        
## [15] "COASTALFLOOD"                "COASTALSTORM"               
## [17] "HEAVY SURF COASTAL FLOODING" "HIGH WINDS/COASTAL FLOOD"

Therefore the following data transformations were made in the variable EVTYPE:

  • elimination of unnecessary spaces
  • changing strings to uppercase to avoid multiplicity of contents
  • summarize the contents of the variable to reduce the amount of event types
## replace blanks by space and put strings to upper case
noaaStormData$EVTYPE <- gsub("^ *", "",
                             toupper(as.character(noaaStormData$EVTYPE)))
noaaStormData$EVTYPE <- gsub("*  *", " ",
                             toupper(as.character(noaaStormData$EVTYPE)))
## replace event types to summarizing them
noaaStormData$EVTYPE <- gsub("(^+|.+|)FLOOD(|+.+$)", "FLOOD",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HEAT(|+.+$)", "HEAT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)WARM(|+.+$)", "HEAT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HOT(|+.+$)", "HEAT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)SNOW(|+.+$)", "SNOW",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)ICE(|+.+$)", "SNOW",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)WIND(|+.+$)", "WIND",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HAIL(|+.+$)", "HAIL",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)RAIN(|+.+$)", "RAIN",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)HURRICANE(|+.+$)", "HURRICANE",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)TORNADO(|+.+$)", "TORNADO",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)THUNDERSTORM(|+.+$)", "THUNDERSTORM",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)STORM(|+.+$)", "THUNDERSTORM",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)WATERSPOUT(|+.+$)", "WATERSPOUT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)SUMMARY(|+.+$)", "SUMMARY",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)DRY(|+.+$)", "DRY",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)CURRENT(|+.+$)", "RIP CURRENT",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)LIGHTNING(|+.+$)", "LIGHTNING",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)COLD(|+.+$)", "COLD",noaaStormData$EVTYPE)
noaaStormData$EVTYPE <- gsub("(^+|.+|)FREEZ(|+.+$)", "COLD",noaaStormData$EVTYPE)

This way, the amount of event types decreased:

## amount of event types
length(unique(noaaStormData$EVTYPE))
## [1] 191

Date variable

There is information that in the earlier years of the database there are generally fewer events recorded and more recent years should be considered more complete. Therefore, it was necessary to analyze the amount of occurrences by year.

For this, it was created the variable “year” and analyzed the amount of occurrences by year:

## creating the variable "year"
noaaStormData$year <- year(strptime(noaaStormData$BGN_DATE, "%m/%d/%Y"))

This plot shows the amount of occurrences by year in the database:

## creating plot to show the amount of occurrences by year in the database
subYearAnalysis <- count(noaaStormData, vars = "year")
ggplot(subYearAnalysis, aes(year, freq) ) +   geom_line() + ggtitle("NOAA Storm Database Occurrences by Year") + geom_point(data=subYearAnalysis[subYearAnalysis$year==1995,],size=5,colour="red")

plot of chunk plotYarr

Aiming a more uniform results of the analysis, we chose to consider only the data from 1995, the year that there is a greater increase in occurrences, as the following table:

## table with the frequency of occurrences form 1993 to 1998
subYearAnalysisTemp <- subYearAnalysis[subYearAnalysis$year > 1992,]                    
panderOptions('table.split.table', Inf) 
set.caption('NOAA Storm Database frequency of occurrences from 1993 to 1998')            
pander(head(subYearAnalysisTemp))
## 
## ----------------------
##  &nbsp;   year   freq 
## -------- ------ ------
##  **44**   1993  12607 
## 
##  **45**   1994  20631 
## 
##  **46**   1995  27970 
## 
##  **47**   1996  32270 
## 
##  **48**   1997  28680 
## 
##  **49**   1998  38128 
## ----------------------
## 
## Table: NOAA Storm Database frequency of occurrences from 1993 to 1998

Then generated a subset of the database records only with the years from 1995 to the analyzes presented in this report.Furthermore, the subset of filtered data was only composed of the following columns, unique necessary to answer the questions addressed in this report:FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP and year of occurence of event type.

## creating new subset with main variables
noaaStormDataFinal <- noaaStormData[noaaStormData$year > 1994, c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "year" )]

Damage variables

Based on the subset of filtered data per year of events with only the necessary variables to the analysis, it is still necessary to make some changes to allow for the assessment of economic consequences.

The types of events with the greatest economic consequence can be calculate with the columns for economic damage on crops (CROPDMG) and for economic damage on properties (PROPDMG).Furthermore, the variables CROPDMGEXP and PROPDMGEXP had an alphabetic character signifying the magnitude of the variables CROPDMG and PROPDMG, respectivily : “K” for thousands, “M” for millions, and “B” for billions. Then, it is necessary to create a new variable (economicDamage) to sum CROPDMG and PROPDMG adjusting the magnitude of values based on the variables CROPDMGEXP and PROPDMGEXP.

Before that, it is necessary to transform the values of CROPDMGEXP and PROPDMGEXP to uppercase variables because there is no uniformity in the values of these variables, as shown below:

# Values for CROPDMGEXP
unique(noaaStormDataFinal$CROPDMGEXP)
## [1]   M m K B ? 0 k 2
## Levels:  ? 0 2 B k K m M
# Values for PROPDMGEXP
unique(noaaStormDataFinal$PROPDMGEXP)
##  [1]   B M K m + 0 5 6 ? 4 2 3 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

Considering the values of CROPDMGEXP and PROPDMGEXP variables as the magnitude of CROPDMG and PROPDMG variables, it was necessary to make the following changes, regarding them as numeric values possible.

The following changes were processed in PROPDMGEXP variable

## Transform PROPDMGEXP variable
## Transform h or H into 2
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[hH]","2",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform k or K into 3
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[kK]","3",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform m or M into 6
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[mM]","6",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform b or B into 9
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[bB]","9",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform - or + or ? into 0
noaaStormDataFinal[,"PROPDMGEXP"] <- gsub("[-\\+\\?]","0",noaaStormDataFinal[,"PROPDMGEXP"])
## Transform empty strings to zero
noaaStormDataFinal$PROPDMGEXP[!nzchar(noaaStormDataFinal$PROPDMGEXP)] <- "0"

The following changes were processed in CROPDMGEXP variable:

## Transform CROPDMGEXP variable
# Transform h or H into 2
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[hH]","2",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform k or K into 3
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[kK]","3",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform m or M into 6
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[mM]","6",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform b or B into 9
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[bB]","9",noaaStormDataFinal[,"CROPDMGEXP"])
# Transform - or + or ? into 0
noaaStormDataFinal[,"CROPDMGEXP"] <- gsub("[-\\+\\?]","0",noaaStormDataFinal[,"CROPDMGEXP"])
# Transforempty strings to zero
noaaStormDataFinal$CROPDMGEXP[!nzchar(noaaStormDataFinal$CROPDMGEXP)] <- "0"

Thus, the variable economicDamage was obtained by the following formula: economicDamage = (CROPDMGx10^CROPDMGEXP) + (PROPDMGx10^PROPDMGEXP)

## Creating new subset data with economicDemage Variable
noaaStormDataFinal$economicDamage<- noaaStormDataFinal$CROPDMG * (10^as.numeric(noaaStormDataFinal$CROPDMGEXP))+ 
        noaaStormDataFinal$PROPDMG * (10^as.numeric(noaaStormDataFinal$PROPDMGEXP))

The first records of the new subset data is the following:

## The first records of noaaStormDataFinal dataset
head(noaaStormDataFinal)
##        EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 187560   RAIN          0        0     0.0          0       0          0
## 187561   SNOW          0        0     0.0          0       0          0
## 187563   SNOW          0        0     0.0          0       0          0
## 187565   SNOW          0        0     0.0          0       0          0
## 187566   WIND          2        0     0.1          9      10          6
## 187575   HAIL          0        0     0.0          0       0          0
##        year economicDamage
## 187560 1995        0.0e+00
## 187561 1995        0.0e+00
## 187563 1995        0.0e+00
## 187565 1995        0.0e+00
## 187566 1995        1.1e+08
## 187575 1995        0.0e+00

Results

Types of events are most harmful to population health

In order to show the main types of events are most harmful to population health, it is analyzed in this report on two aspects: amount of fatalities and amount of injuries.

Fatalities

Summarizing the amount of fatalities by type of events, it obtains the following graph:

fatalities = aggregate(FATALITIES ~ EVTYPE, data = noaaStormDataFinal, FUN = sum)
fatalities = fatalities[order(fatalities$FATALITIES, decreasing = T), ]
fatalities$EVTYPE <- factor(fatalities$EVTYPE, levels = fatalities$EVTYPE[order(fatalities$FATALITIES)])
p  <- ggplot(fatalities[1:10,], aes(EVTYPE, FATALITIES, fill=EVTYPE)) 
p <- p + geom_bar(stat = "identity") +  ylab("Amount of Fatalities") + xlab("Event Type") 
p <- p + ggtitle("Types of Events Causing Fatalities Across the U.S (Top 10)")
p + theme(axis.text.x  = element_text(angle=90, vjust=0.5))

plot of chunk fatalities #### Injuries

Summarizing the amount of injuries by type of events, it obtains the following graph:

injuries = aggregate(INJURIES ~ EVTYPE, data = noaaStormDataFinal, FUN = sum)
injuries = injuries[order(injuries$INJURIES, decreasing = T), ]
injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE[order(injuries$INJURIES)])
p  <- ggplot(injuries[1:10,], aes(EVTYPE, INJURIES, fill=EVTYPE)) 
p <- p + geom_bar(stat = "identity") +  ylab("Amount of Injuries") + xlab("Event Type") 
p <- p + ggtitle("Types of Events Causing Injuries Across the U.S (Top 10)")
p + theme(axis.text.x  = element_text(angle=90, vjust=0.5))

plot of chunk injuries

Conclusion about the types of events most harmful

In both of cases (Fatalities and Injuries), the types of events most harmful with respect to population health are lightning, wind, flood, heat and tornado. Tornado caused more injuries and heat more fatalities.

Types of events with the greatest economic consequences

In order to show the main types of events with the greatest economic impact, it is analyzed in this report the amount of economic damage to crops and economic damage on properties in US dollars. This amount was saved in economicDamage variable, considered the magnitude of the values , as described in “Data Cleaning Process - Damage variables” section.

Summarizing the variable economicDamage by type of event, it obtains the following chart:

ecoDamage = aggregate(economicDamage ~ EVTYPE, data = noaaStormDataFinal, FUN = sum)
ecoDamage = ecoDamage[order(ecoDamage$economicDamage, decreasing = T), ]
ecoDamage$EVTYPE <- factor(ecoDamage$EVTYPE, levels = ecoDamage$EVTYPE[order(ecoDamage$economicDamage)])
p  <- ggplot(ecoDamage[1:10,], aes(EVTYPE, economicDamage, fill=EVTYPE)) 
p <- p + geom_bar(stat = "identity") +  ylab("US$") + xlab("Event Type") 
p <- p + ggtitle("Types of Events Causing Economic Damages - U.S. (Top 10)")
p + theme(axis.text.x  = element_text(angle=90, vjust=0.5))

plot of chunk ecoDamage

In conclusion, the types of events with the greatest economic impact are heat, tornado and flood. Heat is the main type of event that causes economic damage.