Synopsis

The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The goal of this project is to identify which types of events are most harmful with respect to population health, and which ones have the greatest economic consequences across the United States. In order to accomplish that, the NOAA data was downloaded, cleaned, transformed, and analyzed to identify the types of events that produce the highest number of deaths and injuries, and respectively property and crop damage, across the entire country. The results of the analysis indicate that regarding population health, in the better recorded recent history (since 1993) heat waves are responsible for the highest number of deaths, while tornadoes account for the highest number of injuries. Regarding economic consequences, major storms produce the biggest property damage, while drought, floods, and harsh winter conditions produce most of the crop damages.

Data Processing

# The pander package is used for nicer formatting of tables and printed results
library(pander)
library(tidyr)
library(ggplot2)
# For reproducibility purpose, display the date and time of data download and report generation
# Cache option has been disabled in this code chunk in order for knitr to print the current time
pander(date())

Sun May 24 15:58:48 2015

Retrieving and loading data

The analysis starts with the downloading of the raw data file from the web link provided in the project assignment. The data is read directly from the compressed .csv.bz2 file, as the read.csv command can read it without the need to uncompress it first.

# Load data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="StormData.csv.bz2", method="curl", quiet=TRUE)
data <- read.csv("StormData.csv.bz2", stringsAsFactors=FALSE)

Data cleaning and transformation

The raw data set contains multiple types of problems and errors which makes its direct analysis difficult. Examples of these problems and errors include, but are not limited to:

  • the lack of a controlled vocabulary for the description of the event type, which produced an explosion of similar but different terms used to encode similar type of events
pander(table(data$EVTYPE)[830:837])
Table continues below
TORNADO TORNADO DEBRIS TORNADOES TORNADOES, TSTM WIND, HAIL
60652 1 2 1
TORNADO F0 TORNADO F1 TORNADO F2 TORNADO F3
19 4 3 2

Table 1. Examples of different encodings of some TORNADO type events

pander(table(data$PROPDMGEXP))
Table continues below
  - ? + 0 1 2 3 4 5
465934 1 8 5 216 25 13 4 4 28
6 7 8 B h H K m M
4 5 1 40 1 6 424665 7 11330

Table 2. Number of each distinct values in the PROPDMGEXP field


pander(table(data$CROPDMGEXP))
  ? 0 2 B k K m M
618413 7 19 1 9 21 281832 1 1994

Table 3. Number of each distinct values in the CROPDMGEXP field

  • at the same time, at least one very significant error apears in the PROPDMGEXP field (“B” instead of “M”) that creates an excessive property damage report of $115 billion for a single event. That single event was part of a California flood where the total damage was reported to be about $300 million.
pander(subset(data,REFNUM==605943,select=c(BGN_DATE,COUNTYNAME,STATE,EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)))
Table continues below
  BGN_DATE COUNTYNAME STATE EVTYPE
605953 1/1/2006 0:00:00 NAPA CA FLOOD
  PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
605953 115 B 32.5 M

Table 4. Summary of the event containg an error in the PROPDMGEXP field (“B” instead of “M”). This mistake would make this event at 115 billion (instead of million) the most costly weather event in the recorded history in the US, and would introduce a major source of error in the results of the analysis

That is even more evident when looking at the REMARKS field for that record which shows that the damages are in the tens of millions and not billions.

pander(subset(data,REFNUM==605943,select=REMARKS))
  REMARKS
605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.

Table 5. Description of the above event showing that the actual damage was in the tens of million range


For those reasons I decided to create a cleaner dataset using the following methodology:

  1. Replace billions with millions for the above mentioned single event (Jan 1st, 2006 Napa Valley flood).
data[data$REFNUM==605943,]$PROPDMGEXP="M"
  1. For property and crop damage, consider only the values that have B, b, M, m, K, k, H, h in the PROPDMGEXP and CROPDMGEXP and ignore the rest that are either empty or have -, +, ?, or [0-9] numeric values.
# set the multiplication factor for property damage
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$PROPDMGFACTOR <- 0
data$PROPDMGFACTOR[data$PROPDMGEXP=="B"] <- 1000000000
data$PROPDMGFACTOR[data$PROPDMGEXP=="M"] <- 1000000
data$PROPDMGFACTOR[data$PROPDMGEXP=="K"] <- 1000
data$PROPDMGFACTOR[data$PROPDMGEXP=="H"] <- 100
# set the multiplication factor for property damage
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$CROPDMGFACTOR <- 0
data$CROPDMGFACTOR[data$CROPDMGEXP=="B"] <- 1000000000
data$CROPDMGFACTOR[data$CROPDMGEXP=="M"] <- 1000000
data$CROPDMGFACTOR[data$CROPDMGEXP=="K"] <- 1000
data$CROPDMGFACTOR[data$CROPDMGEXP=="H"] <- 100
# compute the property and crop damages
data$PROPDAMAGE <- data$PROPDMG * data$PROPDMGFACTOR
data$CROPDAMAGE <- data$CROPDMG * data$CROPDMGFACTOR
  1. Look at the most harmful events to identify keywords that can be used to group similar types of events.
harm <- aggregate(cbind(FATALITIES,INJURIES,PROPDAMAGE,CROPDAMAGE) ~ EVTYPE, data=data, sum)
harm$EVTYPE[order(harm$FATALITIES, na.last=TRUE, decreasing = TRUE)[1:40]]
##  [1] "TORNADO"                    "EXCESSIVE HEAT"            
##  [3] "FLASH FLOOD"                "HEAT"                      
##  [5] "LIGHTNING"                  "TSTM WIND"                 
##  [7] "FLOOD"                      "RIP CURRENT"               
##  [9] "HIGH WIND"                  "AVALANCHE"                 
## [11] "WINTER STORM"               "RIP CURRENTS"              
## [13] "HEAT WAVE"                  "EXTREME COLD"              
## [15] "THUNDERSTORM WIND"          "HEAVY SNOW"                
## [17] "EXTREME COLD/WIND CHILL"    "STRONG WIND"               
## [19] "BLIZZARD"                   "HIGH SURF"                 
## [21] "HEAVY RAIN"                 "EXTREME HEAT"              
## [23] "COLD/WIND CHILL"            "ICE STORM"                 
## [25] "WILDFIRE"                   "HURRICANE/TYPHOON"         
## [27] "THUNDERSTORM WINDS"         "FOG"                       
## [29] "HURRICANE"                  "TROPICAL STORM"            
## [31] "HEAVY SURF/HIGH SURF"       "LANDSLIDE"                 
## [33] "COLD"                       "HIGH WINDS"                
## [35] "TSUNAMI"                    "WINTER WEATHER"            
## [37] "UNSEASONABLY WARM AND DRY"  "URBAN/SML STREAM FLD"      
## [39] "WINTER WEATHER/MIX"         "TORNADOES, TSTM WIND, HAIL"
harm$EVTYPE[order(harm$INJURIES, na.last=TRUE, decreasing = TRUE)[1:40]]
##  [1] "TORNADO"              "TSTM WIND"            "FLOOD"               
##  [4] "EXCESSIVE HEAT"       "LIGHTNING"            "HEAT"                
##  [7] "ICE STORM"            "FLASH FLOOD"          "THUNDERSTORM WIND"   
## [10] "HAIL"                 "WINTER STORM"         "HURRICANE/TYPHOON"   
## [13] "HIGH WIND"            "HEAVY SNOW"           "WILDFIRE"            
## [16] "THUNDERSTORM WINDS"   "BLIZZARD"             "FOG"                 
## [19] "WILD/FOREST FIRE"     "DUST STORM"           "WINTER WEATHER"      
## [22] "DENSE FOG"            "TROPICAL STORM"       "HEAT WAVE"           
## [25] "HIGH WINDS"           "RIP CURRENTS"         "STRONG WIND"         
## [28] "HEAVY RAIN"           "RIP CURRENT"          "EXTREME COLD"        
## [31] "GLAZE"                "AVALANCHE"            "EXTREME HEAT"        
## [34] "HIGH SURF"            "WILD FIRES"           "ICE"                 
## [37] "TSUNAMI"              "TSTM WIND/HAIL"       "WIND"                
## [40] "URBAN/SML STREAM FLD"
harm$EVTYPE[order(harm$PROPDAMAGE, na.last=TRUE, decreasing = TRUE)[1:40]]
##  [1] "HURRICANE/TYPHOON"          "TORNADO"                   
##  [3] "STORM SURGE"                "FLOOD"                     
##  [5] "FLASH FLOOD"                "HAIL"                      
##  [7] "HURRICANE"                  "TROPICAL STORM"            
##  [9] "WINTER STORM"               "HIGH WIND"                 
## [11] "RIVER FLOOD"                "WILDFIRE"                  
## [13] "STORM SURGE/TIDE"           "TSTM WIND"                 
## [15] "ICE STORM"                  "THUNDERSTORM WIND"         
## [17] "HURRICANE OPAL"             "WILD/FOREST FIRE"          
## [19] "HEAVY RAIN/SEVERE WEATHER"  "THUNDERSTORM WINDS"        
## [21] "TORNADOES, TSTM WIND, HAIL" "SEVERE THUNDERSTORM"       
## [23] "DROUGHT"                    "HEAVY SNOW"                
## [25] "LIGHTNING"                  "HEAVY RAIN"                
## [27] "BLIZZARD"                   "WILD FIRES"                
## [29] "HIGH WINDS"                 "TYPHOON"                   
## [31] "LANDSLIDE"                  "FLASH FLOODING"            
## [33] "FLASH FLOOD/FLOOD"          "HURRICANE ERIN"            
## [35] "HAILSTORM"                  "COASTAL FLOOD"             
## [37] "STRONG WIND"                "FLOOD/FLASH FLOOD"         
## [39] "TSUNAMI"                    "COASTAL FLOODING"
harm$EVTYPE[order(harm$CROPDAMAGE, na.last=TRUE, decreasing = TRUE)[1:40]]
##  [1] "DROUGHT"                   "FLOOD"                    
##  [3] "RIVER FLOOD"               "ICE STORM"                
##  [5] "HAIL"                      "HURRICANE"                
##  [7] "HURRICANE/TYPHOON"         "FLASH FLOOD"              
##  [9] "EXTREME COLD"              "FROST/FREEZE"             
## [11] "HEAVY RAIN"                "TROPICAL STORM"           
## [13] "HIGH WIND"                 "TSTM WIND"                
## [15] "EXCESSIVE HEAT"            "FREEZE"                   
## [17] "TORNADO"                   "THUNDERSTORM WIND"        
## [19] "HEAT"                      "WILDFIRE"                 
## [21] "DAMAGING FREEZE"           "THUNDERSTORM WINDS"       
## [23] "EXCESSIVE WETNESS"         "HURRICANE ERIN"           
## [25] "HEAVY SNOW"                "FLOOD/RAIN/WINDS"         
## [27] "BLIZZARD"                  "WILD/FOREST FIRE"         
## [29] "FLOOD/FLASH FLOOD"         "COLD AND WET CONDITIONS"  
## [31] "FROST"                     "STRONG WIND"              
## [33] "TSTM WIND/HAIL"            "HEAVY RAINS"              
## [35] "Early Frost"               "HIGH WINDS"               
## [37] "Damaging Freeze"           "SEVERE THUNDERSTORM WINDS"
## [39] "AGRICULTURAL FREEZE"       "River Flooding"
  1. Create a new set of event categories, based on specific keywords extracted from the most harmful event types, and bundle everything else in the “Other” category. Please note that the kyewords appearing later in the list will override assignments of the previous ones. For example “WINTER STORM” will be categorized as “Winter Conditions” and not “Storm”, and “SNOW DROUGHT” will be in the “Drought” category and not “Winter Conditions”. Some of the initially chosen categories were later removed because of very low total impact on the final results, and their corresponding values were added to the “Other” category.
data$EVTYPE <- toupper(data$EVTYPE)
# put all events in the "Other" category by default
data$CATEG <- "Other"
# replace that with other categories based on specific keywords
#data$CATEG[grep("RAIN", data$EVTYPE)] <- "Rain"
#data$CATEG[grep("FOG", data$EVTYPE)] <- "Fog"
# data$CATEG[grep("FIRE", data$EVTYPE)] <- "Fire"
data$CATEG[grep("WIND", data$EVTYPE)] <- "Wind"
data$CATEG[grep("HEAT", data$EVTYPE)] <- "Heat"
data$CATEG[grep("FLOOD", data$EVTYPE)] <- "Flood"
data$CATEG[grep("HURRICANE", data$EVTYPE)] <- "Storm"
data$CATEG[grep("STORM", data$EVTYPE)] <- "Storm"
data$CATEG[grep("TSTM", data$EVTYPE)] <- "Storm"
data$CATEG[grep("TYPHOON", data$EVTYPE)] <- "Storm"
data$CATEG[grep("AVALANCHE", data$EVTYPE)] <- "Winter"
data$CATEG[grep("BLIZZARD", data$EVTYPE)] <- "Winter"
data$CATEG[grep("COLD", data$EVTYPE)] <- "Winter"
data$CATEG[grep("GLAZE", data$EVTYPE)] <- "Winter"
data$CATEG[grep("ICE", data$EVTYPE)] <- "Winter"
data$CATEG[grep("SNOW", data$EVTYPE)] <- "Winter"
data$CATEG[grep("WINTER", data$EVTYPE)] <- "Winter"
data$CATEG[grep("FROST", data$EVTYPE)] <- "Winter"
data$CATEG[grep("FREEZE", data$EVTYPE)] <- "Winter"
#data$CATEG[grep("RIP CURRENT", data$EVTYPE)] <- "Rip Current"
data$CATEG[grep("HAIL", data$EVTYPE)] <- "Hail"
data$CATEG[grep("LIGHTNING", data$EVTYPE)] <- "Lightning"
#data$CATEG[grep("SURF", data$EVTYPE)] <- "Surf"
data$CATEG[grep("TORNADO", data$EVTYPE)] <- "Tornado"
#data$CATEG[grep("TSUNAMI", data$EVTYPE)] <- "Tsunami"
data$CATEG[grep("DROUGHT", data$EVTYPE)] <- "Drought"
#data$CATEG[grep("LANDSLIDE", data$EVTYPE)] <- "Landslide"
  1. Finally, create a tidy data set that contains only the information needed for analysis.
data$YEAR <- as.numeric(format(as.Date(data$BGN_DATE, '%m/%d/%Y'), '%Y'))
subtotal <- aggregate(cbind(FATALITIES,INJURIES,PROPDAMAGE,CROPDAMAGE) ~ YEAR + CATEG, data=data, sum)
tidy <- gather(subtotal,ESTIMATE,VALUE, -c(YEAR,CATEG))

Results

The project assignment mentions the lack of good records for older events and that more recent years should be considered more complete. In order to explore the completeness of the data set, I decided to create a time series plot for all the categories of events across the time and the four main type of recorded values: fatalities, injuries, property damage and crop damage.

theme_set(theme_bw())
qplot(x=YEAR, y=VALUE, data=tidy, geom="line", color=CATEG, xlab="Year", ylab="Estimated Total") + facet_grid(ESTIMATE ~., scales='free') + scale_fill_manual(values=c("#a6cee3", "#1f78b4", "#b2df8a", "#33a02c", "#fb9a99", "#e31a1c", "#fdbf6f", "#ff7f00", "#cab2d6", "#6a3d9a" )) + theme(legend.key = element_blank())

Figure 1. Time series plot of the total values for the ten chosen event categories.

The graph shows that indeed most of the event categories are missing before 1993, which would make the analysis biased if the SUM of the values in all categories would be computed over the entire time period (1950-2911). Potential alternatives for correcting this problem include imputing the missing data, or using other statistical measures instead of sum (median, etc). However, for simplicity, I decided to include in the further analysis only the more recent values after 1993, and create in Figure 2 a barplot of the total values for this period.

recent <- subset(tidy,YEAR > 1992)
total <- aggregate(VALUE ~ CATEG + ESTIMATE, data=recent, sum)
qplot(x=CATEG, y=VALUE, data=total, geom="bar", stat="identity", xlab="Event Category", ylab="Estimated Total Value") + facet_grid(ESTIMATE ~., scales='free')

Figure 2. Bar plot of the total values for the ten event categories and the four types of recorded values

The results of the analysis depicted in the figure above indicate that, after 1993:

  1. The top causes of death were the heat waves, followed by tornadoes, floods and harsh winter conditions.
  2. The highest number of injuries were caused by tornadoes, followed by heat, floods, and winter conditions.
  3. The higest level of property damage was produced by major thunderstorms, followed in a smaller measure by floods and tornadoes.
  4. The most of the damages to agricultural crops were produced by droughts, floods, harsh winters, and storms. However, the level of these damages is about one order of magnitude smaller than the ones for property damage.

Conclusions

The NOAA data set contains a large number of weather events, collected over a more than 60 years period. Because of that, this raw data set contains multiple types of problems and errors which makes its direct analysis difficult. Choosing the appropriate way to correct most of these problems is very subjective, and multiple approaches can be equaly relevant for performing that, although the manual curation by NOAA experts would probably be the best long term solution to all these problems. However, even with these known problems, knowledge can be extracted from this data set as long as we focus on the most relevant information present in this curated data collection.