1 Synopsis

This project analyses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and identifies the events which cause the greatest human and economic impact from 1950 to end November 2011. The analysis uses pareto charts to show the “vital few” events which cause the most impact. This approach allows the persons responsible for preparing for severe weather events to plan and prioritize their resources in the future.

2 Data Processing

2.1 Loading the data

The data is downloaded from the links provided in the Coursera Assignment rubric.

The National Weather Service Storm Data Documentation and FAQ should be referenced to understand the data.

# Data file location
dataFile <- "./data/StormData.csv.bz2"

options(scipen=9)

# Load libraries
library(reshape2)

# Load data (561.6MB takes some time...)
dataSet <-read.csv(dataFile, stringsAsFactors = FALSE,sep=",")

After loading the data, the following actions were taken to prepare the data for processing.

Further, because there are some storm events which are considered as valid events (See Appendix: Valid Events), it is necessary to remap or discard events which are miscategorised, misspelled, or invalid. I created a function mapEvents() to perform the necessary string replacement operations (See Appendix: Mapping Events).

To calculate the economic cost of the storm events, some calculation needs to be done by parsing the PROPDMG and PROPDMGEXP columns for property damages and CROPDMG and CROPDMGEXP columns for crop damages. I created a function called calcExps() to perform the necessary calculations (See Appendix: Calculating Value of Damages).

mapEvents() is performed on the master dataset, while calcExps() is performed on the subset of data related to economic damage only. The function is not exhaustive, due to the need to prioritize time and effort, so some data is discarded; however, a calculation is performed and described to show that the discarded data does not significantly impact the analysis.

# convert events to uppercase
dataSet$EVTYPE <- toupper(dataSet$EVTYPE)

# trim leading and trailing whitespace
dataSet$EVTYPE <- gsub("^\\s+|\\s+$", "", dataSet$EVTYPE)
dataSet$EVTYPE <- gsub("\\s+", " ", dataSet$EVTYPE)
dataSet$EVTYPE <- as.factor(dataSet$EVTYPE)

dataSet$EVTYPE <- mapEvents(dataSet$EVTYPE)

2.2 Part 1: Population Health Data

Here we subset the data only for events which have resulted in fatalisties or injuries.

dataSet2 <- dataSet[which(dataSet$FATALITIES>0 | dataSet$INJURIES>0),c("EVTYPE","FATALITIES","INJURIES")]

# extract only rows for valid events
dataSet2.valid <- dataSet2[which(dataSet2$EVTYPE %in% validEvents),]

# melt and cast the data
dataSet2.melt <- melt(dataSet2.valid, id=c("EVTYPE"), measure.vars=c("FATALITIES","INJURIES"))
dataSet2.tidy <- dcast(dataSet2.melt, EVTYPE ~ variable,sum)

# subset injuries data
dataSet2.injuries <- dataSet2.tidy[order(dataSet2.tidy$INJURIES,decreasing=TRUE),c("EVTYPE","INJURIES")]

dataSet2.injuries$cumsum <- cumsum(dataSet2.injuries$INJURIES)
dataSet2.injuries$pct <- (dataSet2.injuries$INJURIES/sum(dataSet2.injuries$INJURIES))*100
dataSet2.injuries$cumpct <- cumsum(dataSet2.injuries$pct)


# subset fatalities data
dataSet2.fatalities <- dataSet2.tidy[order(dataSet2.tidy$FATALITIES,decreasing=TRUE),c("EVTYPE","FATALITIES")]

# add calculations for pareto chart
dataSet2.fatalities$cumsum <- cumsum(dataSet2.fatalities$FATALITIES)
dataSet2.fatalities$pct <- (dataSet2.fatalities$FATALITIES/sum(dataSet2.fatalities$FATALITIES))*100
dataSet2.fatalities$cumpct <- cumsum(dataSet2.fatalities$pct)

2.2.1 Excluded Data

In this dataset, some event types are discarded because they do not match the list of valid event categories, due to miscategorisation, spelling errors, or other issues.

unique(dataSet2[which(!dataSet2$EVTYPE %in% validEvents),c("EVTYPE")])
##  [1] "MARINE MISHAP"                  "HIGH"                          
##  [3] "DRY MIRCOBURST WINDS"           "DRY MICROBURST"                
##  [5] "WIND"                           "SNOW"                          
##  [7] "THUNDERSNOW"                    "WINDS"                         
##  [9] "SNOW AND ICE"                   "WIND STORM"                    
## [11] "URBAN AND SMALL STREAM FLOODIN" "BLOWING SNOW"                  
## [13] "RAIN/WIND"                      "HYPOTHERMIA"                   
## [15] "SNOW/ BITTER COLD"              "RAPIDLY RISING WATER"          
## [17] "SNOW/HIGH WINDS"                "MINOR FLOODING"                
## [19] "URBAN/SML STREAM FLD"           "MARINE ACCIDENT"               
## [21] "COASTAL STORM"                  "MIXED PRECIP"                  
## [23] "RAIN/SNOW"                      "SNOW SQUALL"                   
## [25] "HYPOTHERMIA/EXPOSURE"           "SNOW SQUALLS"                  
## [27] "HYPERTHERMIA/EXPOSURE"          "WINTRY MIX"                    
## [29] "NON-SEVERE WIND DAMAGE"         "WARM WEATHER"                  
## [31] "LIGHT SNOW"                     "ROGUE WAVE"                    
## [33] "FALLING SNOW/ICE"               "NON TSTM WIND"                 
## [35] "OTHER"                          "DROWNING"

A total of 166 rows are excluded from the analysis. This data represents 0.7569884627662 % the total number of rows from the dataset, which accounts for 0.706503796632552 % of total fatalities and 0.318086075372879 % of total injuries. Hence the excluded data is considered negligible for the purpose of our analysis.

2.3 Part 2: Economic Impact Data

dataSet3 <- dataSet[which(dataSet$PROPDMG>0 | dataSet$CROPDMG>0),c("EVTYPE","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP")]

dataSet3$PROPDMGVALUE <- calcExps(dataSet3$PROPDMG,dataSet3$PROPDMGEXP)
dataSet3$CROPDMGVALUE <- calcExps(dataSet3$CROPDMG,dataSet3$CROPDMGEXP)

# filter out only valid events
dataSet3.valid <- dataSet3[which(dataSet3$EVTYPE %in% validEvents),]

# melt the data
dataSet3.melt <- melt(dataSet3.valid, id=c("EVTYPE"), measure.vars=c("PROPDMGVALUE","CROPDMGVALUE"))
dataSet3.tidy <- dcast(dataSet3.melt, EVTYPE ~ variable,sum)

The data is processed further by adding calculations which will be used to construct a pareto chart in the following section.

# add calculations to make a pareto chart
dataSet3.tidy$TOTALDMG <- dataSet3.tidy$PROPDMGVALUE+dataSet3.tidy$CROPDMGVALUE
dataSet3.tidy <- dataSet3.tidy[order(dataSet3.tidy[,c("TOTALDMG")],decreasing = TRUE),]

dataSet3.tidy$cumsum <- cumsum(dataSet3.tidy$TOTALDMG)
dataSet3.tidy$pct <- (dataSet3.tidy$TOTALDMG/sum(dataSet3.tidy$TOTALDMG))*100
dataSet3.tidy$cumpct <- cumsum(dataSet3.tidy$pct)

2.3.1 Excluded Data

In this dataset, some event types are discarded because they do not match the list of valid event categories, due to miscategorisation, spelling errors, or other issues.

unique(dataSet3[which(!dataSet3$EVTYPE %in% validEvents),c("EVTYPE")])
##  [1] "WIND"                     "BREAKUP FLOODING"        
##  [3] "HIGH TIDES"               "SEVERE TURBULENCE"       
##  [5] "APACHE COUNTY"            "RAINSTORM"               
##  [7] "WINDS"                    "URBAN FLOOD"             
##  [9] "MICROBURST WINDS"         "URBAN/SMALL STREAM FLOOD"
## [11] "URBAN FLOODING"           "MINOR FLOODING"          
## [13] "DAMAGING FREEZE"          "SNOW"                    
## [15] "THUNDERSNOW"              "COOL AND WET"            
## [17] "RURAL FLOOD"              "EXCESSIVE WETNESS"       
## [19] "SLEET/ICE STORM"          "GUSTNADO"                
## [21] "SNOW AND HEAVY SNOW"      "GROUND BLIZZARD"         
## [23] "EXTREME WIND CHILL"       "SNOW/HEAVY SNOW"         
## [25] "WIND DAMAGE"              "SMALL STREAM FLOOD"      
## [27] "SNOW AND ICE"             "WIND STORM"              
## [29] "LAKE FLOOD"               "SNOW AND ICE STORM"      
## [31] "RIVER AND STREAM FLOOD"   "RAIN"                    
## [33] "HEAVY LAKE SNOW"          "DUST DEVIL WATERSPOUT"   
## [35] "MICROBURST"               "BLIZZARD/WINTER STORM"   
## [37] "DUST STORM/HIGH WINDS"    "FOREST FIRES"            
## [39] "HVY RAIN"                 "HARD FREEZE"             
## [41] "URBAN AND SMALL"          "SNOW/COLD"               
## [43] "SNOW SQUALL"              "SNOW/ICE STORM"          
## [45] "HEAVY MIX"                "SNOW FREEZING RAIN"      
## [47] "SNOW/SLEET"               "SNOW/FREEZING RAIN"      
## [49] "SNOW SQUALLS"             "SNOW/SLEET/FREEZING RAIN"
## [51] "COASTAL SURGE"            "SNOW/ICE"                
## [53] "SNOW/ BITTER COLD"        "SNOW/HIGH WINDS"         
## [55] "SNOWMELT FLOODING"        "SNOW ACCUMULATION"       
## [57] "SNOW/ ICE"                "SNOW/BLOWING SNOW"       
## [59] "DRY MICROBURST"           "?"                       
## [61] "URBAN/SMALL STREAM"       "URBAN SMALL"             
## [63] "URBAN FLOODS"             "OTHER"                   
## [65] "URBAN/SML STREAM FLD"     "MARINE ACCIDENT"         
## [67] "EROSION/CSTL FLOOD"       "BEACH EROSION"           
## [69] "UNSEASONABLE COLD"        "EARLY FROST"             
## [71] "WINTRY MIX"               "LANDSLUMP"               
## [73] "COASTAL STORM"            "LIGHT SNOW"              
## [75] "DOWNBURST"                "LIGHT SNOWFALL"          
## [77] "AGRICULTURAL FREEZE"      "LAKE EFFECT SNOW"        
## [79] "MIXED PRECIPITATION"      "DAM BREAK"               
## [81] "BLOWING SNOW"             "GRADIENT WIND"           
## [83] "WET MICROBURST"           "UNSEASONAL RAIN"         
## [85] "COASTAL EROSION"          "LANDSPOUT"               
## [87] "WIND AND WAVE"            "LIGHT FREEZING RAIN"     
## [89] "NON-SEVERE WIND DAMAGE"   "LATE SEASON SNOW"        
## [91] "NON-TSTM WIND"            "BLOWING DUST"            
## [93] "ASTRONOMICAL HIGH TIDE"

A total of 1494 rows are excluded from the analysis. This data represents 0.609718770278046 % the total number of rows from the dataset, which accounts for 0.0429949337452127 % of total property damage value and 1.13168963460089 % of total crop damage value. Hence the excluded data is considered negligible for the purpose of our analysis.

3 Results

3.1 Events most harmful with respect to population health

3.1.1 Storm Events with Most Injuries

Storm events caused approximately 140081 of injuries between 1950 and November 2011.

dataSet2.injuries.top10<- head(dataSet2.injuries,10)

plot.new()
par(mar=c(10, 6, 2, 6))

bp <- barplot(
  dataSet2.injuries.top10$INJURIES/10^3,
  names.arg = dataSet2.injuries.top10$EVTYPE, cex.names = 0.6, las=2,
  ylab = "Injuries ('000)", ylim=c(0,100),
  xlim = c(0, length(dataSet2.injuries.top10$EVTYPE) + 2),
  main="Top 10 Storm Events by Injuries Caused", col="steelblue"
)
title(xlab = "Event Type",line=7)

# the line plot showing cumulative percentage
par(new=T)
plot(
  type="b",cex=0.2,lwd=4, col="black",
  x=bp,  y=dataSet2.injuries.top10$cumpct,
  xlim = c(0, length(dataSet2.injuries.top10$EVTYPE) + 2), ylim=c(0,100),
  xaxt="n", yaxt = "n", yaxs = "i", axes=F, ylab="",xlab=""
)
axis(side = 4, ylim=c(0,100),las=1)
mtext(text="% of total Injuries", side=4, line=3,cex=0.8)

# add a line to show the 80% mark
abline(h=80,col="red",lwd=1)

TORNADO caused the most number of injuries (91364 (65.2222643)), while a range of other events including THUNDERSTORM WIND ( 6.79 % ), EXCESSIVE HEAT ( 5.09 % ) caused 80% of all injuries recorded.

3.1.2 Storm Events with Most Fatalities

Storm events caused approximately 15038 fatalities between 1950 and November 2011.

dataSet2.fatalities.top10<- head(dataSet2.fatalities,10)

plot.new()
par(mar=c(10, 6, 2, 6))

# the barplot showing number of fatalities
bp <- barplot(
  dataSet2.fatalities.top10$FATALITIES/10^3,
  names.arg = dataSet2.fatalities.top10$EVTYPE, cex.names = 0.6, las=2,
  ylab = "Fatalities ('000)", xlim = c(0, length(dataSet2.fatalities.top10$EVTYPE) + 2),
  main="Top 10 Storm Events by Fatalities Caused", col="steelblue"
)
title(xlab = "Event Type",line=7)

# the line plot showing cumulative percentage
par(new=T)
plot(
  type="b",cex=0.2,lwd=4, col="black",
  x=bp, y=dataSet2.fatalities.top10$cumpct,
  xlim = c(0, length(dataSet2.fatalities.top10$EVTYPE) + 2), ylim=c(0,100),
  xaxt="n", yaxt = "n", yaxs = "i", axes=F, ylab="",xlab=""
)
axis(side = 4, ylim=c(0,100),las=1)
mtext(text="% of total Fatalities", side=4, line=3, cex=0.8)

# add a line to show the 80% mark
abline(h=80,col="red",lwd=1,lty=5)

TORNADO caused the most number of fatalities (5658 (37.6246841)), while a range of other events including EXCESSIVE HEAT ( 14.62 % ), FLASH FLOOD ( 6.77 % ), HEAT ( 6.5 % ), LIGHTNING ( 5.43 % ), THUNDERSTORM WIND ( 4.73 % ), RIP CURRENT ( 3.84 % ) caused 80% of all fatalities recorded.

3.2 Events with greatest economic impact

Storm events caused approximately $ 475.68 Billion of economic losses between 1950 and November 2011. Of this total, $ 427.13 Billion are property losses and $ 48.55 Billion are crop losses.

Using the data we can plot a pareto chart which shows the top 10 events which caused the highest economic losses.

# get the top 10 events in terms of economic damage caused
dataSet3.tidy.top10 <- head(dataSet3.tidy,10)

# Plot the pareto chart
plot.new()
par(mar=c(10, 6, 3, 7),cex.lab=0.9,cex.axis=0.8,cex.main=1.2)

bp <- barplot(
  t(dataSet3.tidy.top10[,c("PROPDMGVALUE","CROPDMGVALUE")])/10^9,
  names.arg = dataSet3.tidy.top10$EVTYPE,
  ylab = "Damages ($Billion)", ylim = c(0,200),
  xlim = c(0, length(dataSet3.tidy.top10$EVTYPE) + 2),
  main="Top 10 Storm Events by Economic Impact",
  col=c("dodgerblue","darkseagreen"),border=NA,
  cex.names = 0.6,las=2, 
)
title(xlab = "Event Type",line=8)

par(new=T)
plot(
  x=bp, y=dataSet3.tidy.top10$cumpct,
  type="b",cex=0.2,lwd=4, col="salmon", 
  xlim = c(0, length(dataSet3.tidy.top10$EVTYPE) + 2), ylim=c(0,100),
, xaxt="n", yaxt = "n", yaxs = "i",axes=F, ylab="",xlab=""
)
axis(side = 4, ylim=c(0,100),las=1)
mtext(text="% of total damages", side=4, line=2.5,cex=0.9)

# add a line to show the 80% mark
abline(h=80,col="grey45",lwd=1,lty=5)

Top 10 Storm Events by Economic Impact

From the pareto chart, we can see which “vital few” storm events account for 80% of the total damage, and where mitigation efforts should be focused for most effectiveness. FLOOD have cost the most economic damage, accounting for 33.9 % of the total, followed by HURRICANE (TYPHOON) ( 19.1 % ), TORNADO ( 12.39 % ), STORM SURGE/TIDE ( 10.08 % ), HAIL ( 4 % ).


4 Appendix

4.1 Valid Events

validEvents <- toupper(c("Astronomical Low Tide","Avalanche","Blizzard",
                 "Coastal Flood","Cold/Wind Chill","Debris Flow",
                 "Dense Fog","Dense Smoke","Drought","Dust Devil",
                 "Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
                 "Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog",
                 "Hail","Heat","Heavy Rain","Heavy Snow","High Surf",
                 "High Wind","Hurricane (Typhoon)","Ice Storm",
                 "Lake-Effect Snow","Lakeshore Flood","Lightning",
                 "Marine Hail","Marine High Wind","Marine Strong Wind",
                 "Marine Thunderstorm Wind","Rip Current","Seiche",
                 "Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind",
                 "Tornado","Tropical Depression","Tropical Storm","Tsunami",
                 "Volcanic Ash","Waterspout","Wildfire","Winter Storm",
                 "Winter Weather"
                 ))

4.2 Mapping Events

# mapEvents()
# Maps various erroneous event type data (miscategorisation, spelling errors, etc.)
# from Storm Data to the valid event types

mapEvents <- function(mapto) {
  mapto <- gsub("^TSTM WIND","THUNDERSTORM WIND",mapto)
  mapto <- gsub("^TSTMW","THUNDERSTORM WIND",mapto)

  mapto <- gsub("^THUNDERSTORM.*","THUNDERSTORM WIND",mapto)
  mapto <- gsub("^(THUNDERTORM|THUNDEERSTORM|THUNDERESTORM|THUNDERSTORM|THUNERSTORM|THUDERSTORM|THUNDERSTROM|TUNDERSTORM) (WIND.*|WINS)$","THUNDERSTORM WIND",mapto)
  mapto <- gsub("^SEVERE THUNDERSTORM.*","THUNDERSTORM WIND",mapto)
mapto <- gsub("^STORM FORCE.*","THUNDERSTORM WIND",mapto)


  mapto <- gsub("^WHIRLWIND$","THUNDERSTORM WIND",mapto)

  mapto <- gsub("^MARINE.*WIND.*","MARINE THUNDERSTORM WIND",mapto)
  mapto <- gsub("^(HURRICANE|TYPHOON).*","HURRICANE (TYPHOON)",mapto)
  mapto <- gsub("^TORNADO.*","TORNADO",mapto)
  mapto <- gsub("^TORNDAO","TORNADO",mapto)

mapto <- gsub("^WATERSPOUT.*","WATERSPOUT",mapto)
  mapto <- gsub("^LIGHTNING.*","LIGHTNING",mapto)
mapto <- gsub("^(LIGHTING|LIGNTNING)","LIGHTNING",mapto)
  mapto <- gsub("^AVALANCE.*","AVALANCHE",mapto)
  mapto <- gsub("^WILD.*|BRUSH FIRE|GRASS FIRE.*","WILDFIRE",mapto)
  mapto <- gsub("^HIGH WIND.*","HIGH WIND",mapto)
  mapto <- gsub("^RIP .*","RIP CURRENT",mapto)
  mapto <- gsub("^FLASH FLOOD.*","FLASH FLOOD",mapto)


  mapto <- gsub("^FLOOD.*|HIGH WATER|RIVER FLOODING|RIVER FLOOD|MAJOR FLOOD","FLOOD",mapto)
  mapto <- gsub("^UNSEASONABLY WARM.*","HEAT",mapto)

  mapto <- gsub("^(EXTREME|RECORD|EXCESSIVE) HEAT|HEAT WAVE.*","EXCESSIVE HEAT",mapto)
  mapto <- gsub("^RECORD/EXCESSIVE HEAT","EXCESSIVE HEAT",mapto)

  mapto <- gsub("^(EXTREME|EXCESSIVE|RECORD) (WINDCHILL|COLD).*","EXTREME COLD/WIND CHILL",mapto)
  mapto <- gsub("^(EXTENDED|UNSEASONABLY) COLD.*","COLD/WINDCHILL",mapto)
  mapto <- gsub("^LOW TEMPERATURE","COLD/WINDCHILL",mapto)

  mapto <- gsub("^COLD.*","COLD/WIND CHILL",mapto)
  mapto <- gsub("^TROPICAL STORM.*","TROPICAL STORM",mapto)
  mapto <- gsub("^COASTAL FLOOD.*|TIDAL FLOODING","COASTAL FLOOD",mapto)
  mapto <- gsub("^COASTALSTORM.*","COASTAL STORM",mapto)

  mapto <- gsub("^DROUGHT.*","DROUGHT",mapto)
  mapto <- gsub("^(HEAVY|HIGH|ROUGH|HAZARDOUS) (SEAS|WAVES|SWELLS|SURF).*","HIGH SURF",mapto)

  mapto <- gsub("^WINTER WEATHER.*","WINTER WEATHER",mapto)
  mapto <- gsub("^WINTER STORM.*","WINTER STORM",mapto)
  mapto <- gsub("^ICE STORM.*","-ICE STORM",mapto)
  mapto <- gsub("^GLAZE.*","-FREEZING FOG",mapto)
  mapto <- gsub("^(FROST|FREEZE|ICE|ICY|FREEZING|BLACK ICE).*","FROST/FREEZE",mapto)
  mapto <- gsub("^-ICE STORM","ICE STORM",mapto)
  mapto <- gsub("^-FREEZING FOG","FREEZING FOG",mapto)

  mapto <- gsub("^(HEAVY|EXCESSIVE|TORRENTIAL|RECORD) (RAIN|PRECIPITATION|SHOWER).*","HEAVY RAIN",mapto)
  mapto <- gsub("^(HEAVY|EXCESSIVE|RECORD) SNOW.*","HEAVY SNOW",mapto)
  mapto <- gsub("^(STRONG|GUSTY) WIND.*","STRONG WIND",mapto)

  mapto <- gsub("^(LANDSLIDE|MUDSLIDE|MUD SLIDE|ROCK SLIDE).*","DEBRIS FLOW",mapto)
  mapto <- gsub("^FOG.*","DENSE FOG",mapto)
  mapto <- gsub(".*HAIL.*","HAIL",mapto)
  mapto <- gsub("STORM SURGE.*","STORM SURGE/TIDE",mapto)

  # return the mapped vector
  mapto
}

4.3 Calculating Value of Damages

# calcExps()
# returns a vector of calculated value & exponent pairs
# the substitutions for values of exponents column is based
# on this report:
# https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html

calcExps <- function(values,exps) {
  result <- numeric(length(values))

  for (i in seq_along(values)) {

    exp <- switch(toupper(exps[i]),
                   K=10^3,
                   M=10^6,
                   H=10^2,
                   B=10^9,
                   "+"=1,
                  "1"=10,
                  "2"=10,
                  "3"=10,
                  "4"=10,
                  "5"=10,
                  "6"=10,
                  "7"=10,
                  "8"=10,
                   0 # default
                   )
    result[i] <- values[i]*exp
  }
  result
}