Abstract:

This is a RR Course Project report which identifies severe weather events based on processing and analysis of U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database. The preprocessing of the data attempts to clean up missing, redundant, and inconsistent values of event types in the source data to prevent data contamination. This report has tentative conclusions to help National or State agencies in optimal allocation of limited resources available by focussing attention to a small but dominant subset of a large collection of severe weather events, which contribute disproportionately to elevated loss of human life, injuries, and damage to property and crops. The report also attempts to explore the impact of these events in terms of geographic and temporal dominance to bring further focus on targeted allocation of resources.

Synopsis:

The following is a list of tentative conclusions and conjectures based on the analysis:

  1. A relatively small subset of severe weather events (15 out of more than 220 possible unique events in the NOAA Storm database) appear to cause disproportionate extent of loss of human life and injuries. These N=15 events appear to score a 90th percentile rank in relation to impact from all other events.

  2. This subset of N=15 severe weather events (designated as Event Group A Top N in the report) appear to have a significantly higher contribution (yearly mean fatality, and yearly mean injury) in comparison to all other events when the events are summarized per year) spanning all the years from 1950 to 2012.

  3. A subset of N=25 states/regions have been scored at 90th percentile rank in terms of the overall national casualties.

  4. A different, independent subset of severe weather events (10 out of more than 220 possible unique events in the NOAA Storm database) appear to cause disproportionate extent of property damage measured in U.S dollars. These N=10 events are scored at 90th percentile rank in relation to property damage impact from all other events.

  5. A different, independent subset of severe weather events (13 out of more than 220 possible unique events in the NOAA Storm database) appear to cause disproportionate extent of crop damage measured in U.S dollars. These N=13 events are scored at 90th percentile rank in relation to crop damage impact from all other events.

The subsets identified above are enumerated in the results section of the report.

Data Processing:

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dplyr)

require(lattice)
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.1.3
library(lattice)

require(lubridate)
## Loading required package: lubridate
## Warning: package 'lubridate' was built under R version 3.1.3
library(lubridate)

require(knitr)
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 3.1.3
library(knitr)

library(dplyr)
library(lattice)

require(stats)
library

require(stringr)
## Loading required package: stringr
library(stringr)

require(reshape2)
## Loading required package: reshape2
library(reshape2)
opts_chunk$set(echo=TRUE, cache=TRUE)

We read in the NOAA Storm datase from the raw csv file included in the zip archive. The data is a comma delimited file, and missing values are inherently coded as NA fields in the input data file. The header data is included in the reading.

repdata.data.StormData <- read.csv(bzfile("./repdata-data-StormData.csv.bz2"))

dim(repdata.data.StormData)
## [1] 902297     37
# summary(repdata.data.StormData)

Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The NOAA Storm dataset has a number of redundant labels for Event types. In this section, we try to clean up the data by correct labelling using cleanup code. In this section, we also attempt to reduce the dimensionality of data in the Storm dataset by looking at only the consequential events (fatalities > 0 or injuries > 0) for further analysis.

# Check for any non-numeric values in FATALITIES and INJURIES columns

sum(any(!is.numeric(repdata.data.StormData$FATALITIES)))
## [1] 0
sum(any(!is.numeric(repdata.data.StormData$INJURIES)))
## [1] 0
# Across the United States, which types of events (as indicated in the EVTYPE variable) 
# are most harmful with respect to population health?

mostHarmfulEventDS <- filter(repdata.data.StormData,FATALITIES > 0 | INJURIES > 0)

# We realize this consequential filter reduces the dimensionality of the dataset
# significantly from 902297 rows to 21929 rows, which is approximately 98% reduction
dim(mostHarmfulEventDS)
## [1] 21929    37
# Introduce a new variable to sum the fatalities and injuries for each observation
CASUALTIES <- mostHarmfulEventDS$FATALITIES + mostHarmfulEventDS$INJURIES

casualtyDS <- select(mostHarmfulEventDS, EVTYPE, STATE, BGN_DATE, FATALITIES, INJURIES)

casualtyDS <- mutate(casualtyDS, CASUALTIES)

# Look for data anomalies and inconsistencies in EVTYPE variable in the new dataset

events <- select(casualtyDS,EVTYPE)

uniqEvents <- unique(events)

dim(uniqEvents)
## [1] 220   1
# Notice the events boil down to 220. A quick visual inspection of the unique events
# shows the EVTTYPE variable has a few noticeable classification errors. FOr example,
# the observation with EVTYPE = "HIGH" is ambiguous, and needs removal

#sort(uniqEvents$EVTYPE)

filter(casualtyDS,EVTYPE == "HIGH")
##   EVTYPE STATE           BGN_DATE FATALITIES INJURIES CASUALTIES
## 1   HIGH    CA 11/16/1994 0:00:00          0        1          1
casualtyDS <- filter(casualtyDS, EVTYPE != "HIGH")

casualtyDS$EVTYPE <- gsub("AVALANCE","AVALANCHE",casualtyDS$EVTYPE)

#casualtyDS2 <- casualtyDS

casualtyDS$EVTYPE <- unlist(lapply(casualtyDS$EVTYPE,function(x){toupper(x)}))

dim(casualtyDS)
## [1] 21928     6
casualtyDS$EVTYPE <- gsub("COASTALSTORM$","COASTAL STORM",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("GUSTY WIND$","GUSTY WINDS",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("HEAT$","HEAT WAVE",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("HEAT WAVES$","HEAT WAVE",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("HIGH WINDS$","HIGH WIND",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("LANDSLIDES$","LANDSLIDE",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("RIP CURRENTS$","RIP CURRENT",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("WILD FIRES$","WILDFIRE",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("WINDS$","WIND",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("TSTM WIND$","THUNDERSTORM WIND",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("THUNDERSTORMS WINDS$","THUNDERSTORM WIND",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("THUNDERSTORM  WINDS$","THUNDERSTORM WIND",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("THUNDERSTORM  WINDS$","THUNDERSTORM WIND",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("STRONG WIND$","STRONG WINDS",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("TSTM WIND/HAIL$","THUNDERSTORM WINDS/HAIL",casualtyDS$EVTYPE)

casualtyDS$EVTYPE <- gsub("WATERSPOUT TORNADO$","WATERSPOUT/TORNADO",casualtyDS$EVTYPE)

Summarization of the dataset by EVTYPE and ranking of the data to get the dominant subset of events based on ranking the events using a score of 90th percentile ranking.

# Summarize fatalities, injuries, and overall casualties of the new subset by EVTTYPE
casualtySummary <- summarize(group_by(casualtyDS, EVTYPE), SUM_CASUALTIES=sum(CASUALTIES), SUM_FATALITIES=sum(FATALITIES), SUM_INJURIES=sum(INJURIES))

# To extract the severest of the casualties, apply filter: fatalities > 0 and injuries > 0
# to the summary observations 
# 
severeCasualtySummary <- filter(casualtySummary, SUM_FATALITIES > 0 & SUM_INJURIES > 0)
severeCasualtySummary <- select(severeCasualtySummary, EVTYPE, FATALITIES=SUM_FATALITIES, INJURIES=SUM_INJURIES, CASUALTIES=SUM_CASUALTIES)

# Rank casuality summary by highest to lowest fatalities
severeCasualtySummary <- arrange(severeCasualtySummary, desc(FATALITIES))

PERCENT_FATALITIES <- (100.0 * severeCasualtySummary$FATALITIES/sum(severeCasualtySummary$FATALITIES,na.rm=TRUE))

#View(PERCENT_FATALITIES)

severeCasualtySummary <- mutate(severeCasualtySummary,PERCENT_FATALITIES)

CUM_PERCENT_FATALITIES <- 1:length(PERCENT_FATALITIES)
for(i in 1:length(PERCENT_FATALITIES)) {
  sum <- 0
  for(j in 1:i) {
    sum <- sum + PERCENT_FATALITIES[j]
  }
  CUM_PERCENT_FATALITIES[i] = sum
}

severeCasualtySummary = mutate(severeCasualtySummary,CUM_PERCENT_FATALITIES)

names(severeCasualtySummary)
## [1] "EVTYPE"                 "FATALITIES"            
## [3] "INJURIES"               "CASUALTIES"            
## [5] "PERCENT_FATALITIES"     "CUM_PERCENT_FATALITIES"

For data visualization and easy grasp of the relative magnitude of each event, take the top N events which contribute to 90th percentile ranking of overall fatalities over all years across United States. Use the 90th percentile rank to arrive at the dominant severe weather events.

topNEvents <- filter(severeCasualtySummary, CUM_PERCENT_FATALITIES < 90.1)
#topNEvents <- severeCasualtySummary[1:20,]

# Enumerate the top N events scoring 90th percentile score in descending order
topNEvents$EVTYPE
##  [1] "TORNADO"                 "EXCESSIVE HEAT WAVE"    
##  [3] "HEAT WAVE"               "FLASH FLOOD"            
##  [5] "LIGHTNING"               "THUNDERSTORM WIND"      
##  [7] "RIP CURRENT"             "FLOOD"                  
##  [9] "HIGH WIND"               "AVALANCHE"              
## [11] "WINTER STORM"            "EXTREME COLD"           
## [13] "HEAVY SNOW"              "EXTREME COLD/WIND CHILL"
## [15] "STRONG WINDS"
topNSummaryMelt <- melt(topNEvents,id.vars=c("EVTYPE"),meas.vars=c("FATALITIES","INJURIES","CASUALTIES","PERCENT_FATALITIES"))

names(topNSummaryMelt) <- c("EVTYPE","CATEGORY","VALUE")

Next, we explore how the temporal impact of these top N events ranked by top fatalities is distributed over time across all states in U.S.

casualtyDS <- mutate(casualtyDS,YEAR=year(parse_date_time(BGN_DATE,"%m%d%Y %H%M%S")))

names(casualtyDS)
## [1] "EVTYPE"     "STATE"      "BGN_DATE"   "FATALITIES" "INJURIES"  
## [6] "CASUALTIES" "YEAR"
#View(head(casualtyDS,1000))

casualtySummary2 <- summarize(group_by(casualtyDS, EVTYPE, YEAR), FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES))

dim(casualtySummary2)
## [1] 808   4
#topNEvents$EVTYPE

topNEventsSummary2 <- filter(casualtySummary2,EVTYPE %in% topNEvents$EVTYPE)
#View(topNEventsSummary2)

topNEventsSummaryByYear <- summarize(group_by(topNEventsSummary2, YEAR), CUM_FATALITIES=sum(FATALITIES,na.rm=TRUE), CUM_INJURIES=sum(INJURIES, na.rm=TRUE))

meanAggregateFatalities <- mean(topNEventsSummaryByYear$CUM_FATALITIES,na.rm=TRUE)

MEAN_FATALITIES <- rep(meanAggregateFatalities, nrow(topNEventsSummaryByYear))

length(MEAN_FATALITIES)
## [1] 62
meanAggregateInjuries <- mean(topNEventsSummaryByYear$CUM_INJURIES,na.rm=TRUE)

MEAN_INJURIES <- rep(meanAggregateInjuries, nrow(topNEventsSummaryByYear))

topNEventsSummaryByYear <- mutate(topNEventsSummaryByYear, MEAN_FATALITIES, MEAN_INJURIES)

otherEventsSummary <- filter(casualtySummary2,!EVTYPE %in% topNEvents$EVTYPE)

otherEventsSummaryByYear <- summarize(group_by(otherEventsSummary, YEAR), CUM_FATALITIES=sum(FATALITIES,na.rm=TRUE), CUM_INJURIES=sum(INJURIES, na.rm=TRUE))

meanAggregateFatalities2 <- mean(otherEventsSummaryByYear$CUM_FATALITIES,na.rm=TRUE)

MEAN_FATALITIES2 <- rep(meanAggregateFatalities2, nrow(otherEventsSummaryByYear))

length(MEAN_FATALITIES2)
## [1] 28
meanAggregateInjuries2 <- mean(otherEventsSummaryByYear$CUM_INJURIES,na.rm=TRUE)

MEAN_INJURIES2 <- rep(meanAggregateInjuries2, nrow(otherEventsSummaryByYear))

length(MEAN_INJURIES2)
## [1] 28
# Mean of aggregate fatalities per year of top N severe events
meanAggregateFatalities
## [1] 216.5484
# Mean of aggregate fatalities per year of all other severe events (excluding top N events)
meanAggregateFatalities2
## [1] 61.39286
# Mean of aggregate injuries per year of top N severe events
meanAggregateInjuries
## [1] 2073.145
# Mean of aggregate injuries per year of all other severe events (excluding top N events)
meanAggregateInjuries2
## [1] 428.2857
otherEventsSummaryByYear <- mutate(otherEventsSummaryByYear, MEAN_FATALITIES2, MEAN_INJURIES2)

differentiator1 <- as.factor(rep("TOP_N", nrow(topNEventsSummaryByYear)))

topNEventsSummaryByYear <- mutate(topNEventsSummaryByYear, differentiator1)

differentiator2 <- as.factor(rep("OTHER_THAN_TOP_N", nrow(otherEventsSummaryByYear)))

otherEventsSummaryByYear <- mutate(otherEventsSummaryByYear, differentiator2)

names(topNEventsSummaryByYear) <- c("YEAR","CUM_FATALITIES","CUM_INJURIES","MEAN_FATALITIES","MEAN_INJURIES","CATEGORY")
names(otherEventsSummaryByYear) <- c("YEAR","CUM_FATALITIES","CUM_INJURIES","MEAN_FATALITIES","MEAN_INJURIES","CATEGORY")

yearlyComparisonDS <- rbind(topNEventsSummaryByYear, otherEventsSummaryByYear)

yearlyComparisonDS$CATEGORY <- gsub("TOP_N$","EVENT_GROUP_A_TOP_N",yearlyComparisonDS$CATEGORY)
yearlyComparisonDS$CATEGORY <- gsub("OTHER_THAN_EVENT_GROUP_A_TOP_N$","EVENT_GROUP_B_OTHERS",yearlyComparisonDS$CATEGORY)

Next we explore the regional impact of the top N severe weather events to identify top M states which take the brunt of the fatility and injury impact. Let us use the 90th percentile score to get the top M states for summarization

casualtySummary3 <- summarize(group_by(casualtyDS, EVTYPE, STATE), FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES))

dim(casualtySummary3)
## [1] 1304    4
names(casualtySummary3)
## [1] "EVTYPE"     "STATE"      "FATALITIES" "INJURIES"
topNEventsSummary3 <- filter(casualtySummary3,EVTYPE %in% topNEvents$EVTYPE)

topNEventsSummary3 <- mutate(topNEventsSummary3, CASUALTIES = FATALITIES + INJURIES)

#View(topNEventsSummary3)

topNEventsSummaryByState <- summarize(group_by(topNEventsSummary3, STATE), CUM_CASUALTIES=sum(CASUALTIES))

names(topNEventsSummaryByState)
## [1] "STATE"          "CUM_CASUALTIES"
topNEventsSummaryByState <- arrange(topNEventsSummaryByState, desc(CUM_CASUALTIES))

#### Rank casuality summary by state from highest to lowest total casualties

PERCENT_CASUALTIES <- (100.0 * topNEventsSummaryByState$CUM_CASUALTIES/sum(topNEventsSummaryByState$CUM_CASUALTIES))

#View(PERCENT_CASUALTIES)

topNEventsSummaryByState <- mutate(topNEventsSummaryByState,PERCENT_CASUALTIES)

CUM_PERCENT_CASUALTIES <- 1:length(PERCENT_CASUALTIES)
for(i in 1:length(PERCENT_CASUALTIES)) {
    sum <- 0
    for(j in 1:i) {
        sum <- sum + PERCENT_CASUALTIES[j]
    }
    CUM_PERCENT_CASUALTIES[i] = sum
}

#View(CUM_PERCENT_CASUALTIES)

topNEventsSummaryByState = mutate(topNEventsSummaryByState, CUM_PERCENT_CASUALTIES)

# identify top N states covering 90 percentile of overall casualties (fatalities + injuries)

topNStatesByCasualties <- filter(topNEventsSummaryByState, CUM_PERCENT_CASUALTIES < 90.1)

# topNStatesByCasualties

# Top 20 states make up 82 percentile of total casualties
topNStatesByCasualties[1:20,]
## Source: local data frame [20 x 4]
## 
##    STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1     TX          18163          12.794359               12.79436
## 2     MO           9710           6.839907               19.63427
## 3     AL           9397           6.619424               26.25369
## 4     MS           7084           4.990103               31.24379
## 5     IL           6870           4.839357               36.08315
## 6     OK           6076           4.280049               40.36320
## 7     AR           6010           4.233557               44.59676
## 8     TN           5685           4.004621               48.60138
## 9     OH           5565           3.920091               52.52147
## 10    FL           5377           3.787660               56.30913
## 11    IN           5013           3.531252               59.84038
## 12    GA           4938           3.478420               63.31880
## 13    MI           4909           3.457992               66.77679
## 14    KY           3695           2.602828               69.37962
## 15    NC           3670           2.585217               71.96484
## 16    PA           3621           2.550701               74.51554
## 17    KS           3569           2.514071               77.02961
## 18    LA           3341           2.353463               79.38307
## 19    IA           2861           2.015342               81.39841
## 20    MN           2389           1.682857               83.08127
# Remaining states if the number in the result is more than 20
if(nrow(topNStatesByCasualties) > 20) 
{
    topNStatesByCasualties[21:nrow(topNStatesByCasualties),]
}
## Source: local data frame [5 x 4]
## 
##   STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1    WI           2387           1.681448               84.76272
## 2    MA           2220           1.563810               86.32653
## 3    SC           1901           1.339100               87.66563
## 4    VA           1714           1.207374               88.87300
## 5    NE           1424           1.003092               89.87609

Question 2: Across the United States, which types of events have the greatest economic consequences?

sum(any(is.na(repdata.data.StormData$PROPDMG)))
## [1] 0
sum(any(is.na(repdata.data.StormData$PROPDMGEXP)))
## [1] 0
unique(repdata.data.StormData$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
summarize(group_by(repdata.data.StormData,PROPDMGEXP),PROPEXP_COUNT=n())
## Source: local data frame [19 x 2]
## 
##    PROPDMGEXP PROPEXP_COUNT
## 1                    465934
## 2           -             1
## 3           ?             8
## 4           +             5
## 5           0           216
## 6           1            25
## 7           2            13
## 8           3             4
## 9           4             4
## 10          5            28
## 11          6             4
## 12          7             5
## 13          8             1
## 14          B            40
## 15          h             1
## 16          H             6
## 17          K        424665
## 18          m             7
## 19          M         11330
propertyDamageEvents <- filter(repdata.data.StormData, PROPDMG > 0 & PROPDMGEXP %in% c("K","M","B"))

dim(propertyDamageEvents)
## [1] 238840     37
sum(any(is.na(repdata.data.StormData$CROPDMG)))
## [1] 0
sum(any(is.na(repdata.data.StormData$CROPDMGEXP)))
## [1] 0
sum(any(!is.numeric(repdata.data.StormData$CROPDMG)))
## [1] 0
unique(repdata.data.StormData$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M
summarize(group_by(repdata.data.StormData,CROPDMGEXP),CROPEXP_COUNT=n())
## Source: local data frame [9 x 2]
## 
##   CROPDMGEXP CROPEXP_COUNT
## 1                   618413
## 2          ?             7
## 3          0            19
## 4          2             1
## 5          B             9
## 6          k            21
## 7          K        281832
## 8          m             1
## 9          M          1994
cropDamageEvents <- filter(repdata.data.StormData, CROPDMG > 0 & CROPDMGEXP != "" & CROPDMGEXP %in% c("b","B","k","K","m","M"))
cropDamageEvents$CROPDMGEXP <-gsub("k","K",cropDamageEvents$CROPDMGEXP)
cropDamageEvents$CROPDMGEXP <-gsub("m","M",cropDamageEvents$CROPDMGEXP)

summarize(group_by(cropDamageEvents,CROPDMGEXP),CROPEXP_COUNT=n())
## Source: local data frame [3 x 2]
## 
##   CROPDMGEXP CROPEXP_COUNT
## 1          B             7
## 2          K         20158
## 3          M          1919
propertyDamageEvents$PROPDMGEXP <- gsub("K","1000", propertyDamageEvents$PROPDMGEXP)
propertyDamageEvents$PROPDMGEXP <- gsub("M","1000000", propertyDamageEvents$PROPDMGEXP)
propertyDamageEvents$PROPDMGEXP <- gsub("B","1000000000", propertyDamageEvents$PROPDMGEXP)

cropDamageEvents$CROPDMGEXP <- gsub("K","1000", cropDamageEvents$CROPDMGEXP)
cropDamageEvents$CROPDMGEXP <- gsub("M","1000000", cropDamageEvents$CROPDMGEXP)
cropDamageEvents$CROPDMGEXP <- gsub("B","1000000000", cropDamageEvents$CROPDMGEXP)

propertyDamageEvents$PROPDMG <- propertyDamageEvents$PROPDMG * as.numeric(propertyDamageEvents$PROPDMGEXP)
cropDamageEvents$CROPDMG <- cropDamageEvents$CROPDMG * as.numeric(cropDamageEvents$CROPDMGEXP)

#View(propertyDamageEvents)
#View(cropDamageEvents)

propertyDamageEvents <- filter(propertyDamageEvents, EVTYPE != "HIGH")

propertyDamageEvents$EVTYPE <- gsub("AVALANCE","AVALANCHE",propertyDamageEvents$EVTYPE)

#propertyDamageEvents2 <- propertyDamageEvents

propertyDamageEvents$EVTYPE <- unlist(lapply(propertyDamageEvents$EVTYPE,function(x){toupper(x)}))

dim(propertyDamageEvents)
## [1] 238840     37
propertyDamageEvents$EVTYPE <- gsub("COASTALSTORM$","COASTAL STORM",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("GUSTY WIND$","GUSTY WINDS",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("HEAT WAVES$","HEAT WAVE",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("HEAT$","HEAT WAVE",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("HIGH WINDS$","HIGH WIND",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("LANDSLIDES$","LANDSLIDE",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("RIP CURRENTS$","RIP CURRENT",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("WILD FIRES$","WILDFIRE",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("WINDS$","WIND",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("TSTM WIND$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("THUNDERSTORMS WINDS$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("THUNDERSTORM  WINDS$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("THUNDERSTORM  WINDS$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("STRONG WIND$","STRONG WINDS",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("TSTM WIND/HAIL$","THUNDERSTORM WINDS/HAIL",propertyDamageEvents$EVTYPE)

propertyDamageEvents$EVTYPE <- gsub("WATERSPOUT TORNADO$","WATERSPOUT/TORNADO",propertyDamageEvents$EVTYPE)

Let us rank the severe weather events for overall property damage based on 90th percentile score.

# Summarize fatalities, injuries, and overall casualties of the new subset by EVTTYPE
propertyDamageSummary <- summarize(group_by(propertyDamageEvents, EVTYPE), PROPDMG=sum(PROPDMG))

#View(propertyDamageSummary)

propertyDamageSummary <- arrange(propertyDamageSummary,desc(PROPDMG))

PERCENT_DAMAGE <- (100.0 * propertyDamageSummary$PROPDMG/sum(propertyDamageSummary$PROPDMG,na.rm=TRUE))

#View(PERCENT_DAMAGE)

propertyDamageSummary <- mutate(propertyDamageSummary,PERCENT_DAMAGE)

CUM_PERCENT_DAMAGE <- 1:length(PERCENT_DAMAGE)
for(i in 1:length(PERCENT_DAMAGE)) {
    sum <- 0
    for(j in 1:i) {
        sum <- sum + PERCENT_DAMAGE[j]
    }
    CUM_PERCENT_DAMAGE[i] = sum
}

names(propertyDamageSummary)
## [1] "EVTYPE"         "PROPDMG"        "PERCENT_DAMAGE"
dim(propertyDamageSummary)
## [1] 354   3
propertyDamageSummary <- mutate(propertyDamageSummary, CUM_PERCENT_DAMAGE)

pds <- filter(propertyDamageSummary, CUM_PERCENT_DAMAGE < 90.1)

# Enumerate top N states which cover 90 percentile of the overall property damage
pds$EVTYPE
##  [1] "FLOOD"             "HURRICANE/TYPHOON" "TORNADO"          
##  [4] "STORM SURGE"       "FLASH FLOOD"       "HAIL"             
##  [7] "HURRICANE"         "THUNDERSTORM WIND" "TROPICAL STORM"   
## [10] "WINTER STORM"
pdsReport <- pds

pds <- select(pds, -CUM_PERCENT_DAMAGE)

Similar data cleaning of the NOAA dataset for ranking severe weather events for crop damage.

cropDamageEvents <- filter(cropDamageEvents, EVTYPE != "HIGH")

cropDamageEvents$EVTYPE <- gsub("AVALANCE","AVALANCHE",cropDamageEvents$EVTYPE)

#cropDamageEvents2 <- cropDamageEvents

cropDamageEvents$EVTYPE <- unlist(lapply(cropDamageEvents$EVTYPE,function(x){toupper(x)}))

dim(cropDamageEvents)
## [1] 22084    37
cropDamageEvents$EVTYPE <- gsub("COASTALSTORM$","COASTAL STORM",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("GUSTY WIND$","GUSTY WINDS",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("HEAT WAVES$","HEAT WAVE",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("HEAT$","HEAT WAVE",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("HIGH WINDS$","HIGH WIND",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("LANDSLIDES$","LANDSLIDE",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("RIP CURRENTS$","RIP CURRENT",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("WILD FIRES$","WILDFIRE",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("WINDS$","WIND",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("TSTM WIND$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("THUNDERSTORMS WINDS$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("THUNDERSTORM  WINDS$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("THUNDERSTORM  WINDS$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("STRONG WIND$","STRONG WINDS",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("TSTM WIND/HAIL$","THUNDERSTORM WINDS/HAIL",cropDamageEvents$EVTYPE)

cropDamageEvents$EVTYPE <- gsub("WATERSPOUT TORNADO$","WATERSPOUT/TORNADO",cropDamageEvents$EVTYPE)

Let us rank the severe weather events for overall crop damage based on 90th percentile score.

# Summarize fatalities, injuries, and overall casualties of the new subset by EVTTYPE
cropDamageSummary <- summarize(group_by(cropDamageEvents, EVTYPE), CROPDMG=sum(CROPDMG))

cropDamageSummary <- arrange(cropDamageSummary,desc(CROPDMG))

PERCENT_DAMAGE <- (100.0 * cropDamageSummary$CROPDMG/sum(cropDamageSummary$CROPDMG,na.rm=TRUE))

# View(PERCENT_DAMAGE)

cropDamageSummary <- mutate(cropDamageSummary,PERCENT_DAMAGE)

CUM_PERCENT_DAMAGE <- 1:length(PERCENT_DAMAGE)
for(i in 1:length(PERCENT_DAMAGE)) {
    sum <- 0
    for(j in 1:i) {
        sum <- sum + PERCENT_DAMAGE[j]
    }
    CUM_PERCENT_DAMAGE[i] = sum
}

names(cropDamageSummary)
## [1] "EVTYPE"         "CROPDMG"        "PERCENT_DAMAGE"
cropDamageSummary <- mutate(cropDamageSummary, CUM_PERCENT_DAMAGE)

# Enumerate top N states which cover 90 percentile of the overall property damage
cds <- filter(cropDamageSummary, CUM_PERCENT_DAMAGE < 90.1)

cds$EVTYPE
##  [1] "DROUGHT"           "FLOOD"             "RIVER FLOOD"      
##  [4] "ICE STORM"         "HAIL"              "HURRICANE"        
##  [7] "HURRICANE/TYPHOON" "FLASH FLOOD"       "EXTREME COLD"     
## [10] "THUNDERSTORM WIND" "FROST/FREEZE"      "HEAVY RAIN"
cdsReport <- cds

cds <- select(cds, -CUM_PERCENT_DAMAGE)

Combine top M events for property damage and top M events for crop damage by row binding and add a new column for category type which enables to combine the two sets of events into a single xyplot with two panels.

dim(cropDamageSummary)
## [1] 121   4
dim(propertyDamageSummary)
## [1] 354   4
names(pds) <- c("EVTYPE","DAMAGE_COST","PERCENT_DAMAGE")
names(cds) <- c("EVTYPE","DAMAGE_COST","PERCENT_DAMAGE")

CAT_PROP_DAMAGE <- rep("PROPERTY_DAMAGE",nrow(pds))

pds <- mutate(pds, CATEGORY = CAT_PROP_DAMAGE)

CAT_CROP_DAMAGE <- rep("CROP_DAMAGE",nrow(cds))

cds <- mutate(cds, CATEGORY = CAT_CROP_DAMAGE)

names(pds)
## [1] "EVTYPE"         "DAMAGE_COST"    "PERCENT_DAMAGE" "CATEGORY"
names(cds)
## [1] "EVTYPE"         "DAMAGE_COST"    "PERCENT_DAMAGE" "CATEGORY"
pcDamageComparisonDS <- rbind(pds,cds)

Results:

# Enumerate the N events based on 90th percentile rank for top fatalities
if(nrow(topNEvents) <= 20) 
{
    topNEvents[1:nrow(topNEvents),]
}else topNEvents[1:20,]
## Source: local data frame [15 x 6]
## 
##                     EVTYPE FATALITIES INJURIES CASUALTIES
## 1                  TORNADO       5633    91346      96979
## 2      EXCESSIVE HEAT WAVE       1903     6525       8428
## 3                HEAT WAVE       1114     2479       3593
## 4              FLASH FLOOD        978     1777       2755
## 5                LIGHTNING        816     5230       6046
## 6        THUNDERSTORM WIND        701     9353      10054
## 7              RIP CURRENT        572      529       1101
## 8                    FLOOD        470     6789       7259
## 9                HIGH WIND        283     1439       1722
## 10               AVALANCHE        225      170        395
## 11            WINTER STORM        206     1321       1527
## 12            EXTREME COLD        162      231        393
## 13              HEAVY SNOW        127     1021       1148
## 14 EXTREME COLD/WIND CHILL        125       24        149
## 15            STRONG WINDS        111      301        412
## Variables not shown: PERCENT_FATALITIES (dbl), CUM_PERCENT_FATALITIES
##   (dbl)
# Remaining events if the number in the result is more than 20
if(nrow(topNEvents) > 20) 
{
    topNEvents[21:nrow(topNEvents),]
}

Figure 1: Profile of fatalities and injuries for top N events ranked by fatalities in descending order.

# Figure 1: Comparison of Aggregate fatalities and injuries for top N events.
# Ranking is based on top aggregate of fatalities

# Note: The y-axis values are on logarithmic scase (to the base 10) to capture
# a high dynamic range of the fatality and injury estimates

barchart(VALUE ~ EVTYPE | CATEGORY, 
         data = topNSummaryMelt,
         main="Aggregate Fatalities and Injuries for top N Events over all Years", 
         ylab="Overall Casualties (Fatalities/Injuries/Combined)",
         index.cond=list(c(4,3,2,1)),
         xlab="Event Type",
         col=c("purple"),
         layout=c(1,4),
         scales=list(x=list(rot=45),y=list(log=10))
         #scales=list(x=list(rot=45))
)  

Comparison of Cumulative Fatalities/Injuries Per Year for Top N Events vs Rest of Significant Events:

Comparison of mean of aggregate fatalities for top N worst case events and mean of aggregate fatalities for the rest of the events (excluding the top N events)

# Mean of aggregate fatalities per year of top N severe events
meanAggregateFatalities
## [1] 216.5484
# Mean of aggregate fatalities per year of all other severe events (excluding top N events)
meanAggregateFatalities2
## [1] 61.39286

Comparison of mean of aggregate injuries for top N worst case events and mean of aggregate fatalities for the rest of the events (excluding the top N events)

# Mean of aggregate injuries per year of top N severe events
meanAggregateInjuries
## [1] 2073.145
# Mean of aggregate injuries per year of all other severe events (excluding top N events)
meanAggregateInjuries2
## [1] 428.2857

Figure 2: Comparison of Cumulative Fatalities/Injuries Per Year for Top N Events vs Rest of Events

# Figure 2
xyplot(  CUM_FATALITIES + CUM_INJURIES + MEAN_FATALITIES + MEAN_INJURIES ~ YEAR | CATEGORY, 
         data = yearlyComparisonDS,
         type = "l",
         lty = list(c(1),c(1),c(2),c(2)),
         lwd = c(2),
         ylab="Total Fatalities or Injuries Per Year", 
         xlab="Year of Event Occurrence",
         main="Overall Fatalities/Injuries Per Year for Top N vs Rest of Events",
         index.cond=list(c(2,1)),
         auto.key=list(space="top", rows=4, 
                       points=FALSE, lines=TRUE,
                       col=c("red","blue","red", "blue")),
         col=c("red","blue","red", "blue"),
         layout=c(1,2),
         scales=list(x=list(rot=45))
)

Breakdown of overall casualties by each state for the top N severe weather events:

# Breakdown of overall casualties by state for the top N events

# Top 20 states make up 82 percentile of total casualties
topNStatesByCasualties[1:20,]
## Source: local data frame [20 x 4]
## 
##    STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1     TX          18163          12.794359               12.79436
## 2     MO           9710           6.839907               19.63427
## 3     AL           9397           6.619424               26.25369
## 4     MS           7084           4.990103               31.24379
## 5     IL           6870           4.839357               36.08315
## 6     OK           6076           4.280049               40.36320
## 7     AR           6010           4.233557               44.59676
## 8     TN           5685           4.004621               48.60138
## 9     OH           5565           3.920091               52.52147
## 10    FL           5377           3.787660               56.30913
## 11    IN           5013           3.531252               59.84038
## 12    GA           4938           3.478420               63.31880
## 13    MI           4909           3.457992               66.77679
## 14    KY           3695           2.602828               69.37962
## 15    NC           3670           2.585217               71.96484
## 16    PA           3621           2.550701               74.51554
## 17    KS           3569           2.514071               77.02961
## 18    LA           3341           2.353463               79.38307
## 19    IA           2861           2.015342               81.39841
## 20    MN           2389           1.682857               83.08127
# Remaining states if the number in the result is more than 20
if(nrow(topNStatesByCasualties) > 20) 
{
    topNStatesByCasualties[21:nrow(topNStatesByCasualties),]
}
## Source: local data frame [5 x 4]
## 
##   STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1    WI           2387           1.681448               84.76272
## 2    MA           2220           1.563810               86.32653
## 3    SC           1901           1.339100               87.66563
## 4    VA           1714           1.207374               88.87300
## 5    NE           1424           1.003092               89.87609

Identify top M events for overall property damage based on 90th percentile score

# Enumerate top M states scoring 90th percentile rank for overall property damage
pdsReport
## Source: local data frame [10 x 4]
## 
##               EVTYPE      PROPDMG PERCENT_DAMAGE CUM_PERCENT_DAMAGE
## 1              FLOOD 144657709800      33.855504           33.85550
## 2  HURRICANE/TYPHOON  69305840000      16.220249           50.07575
## 3            TORNADO  56925660480      13.322808           63.39856
## 4        STORM SURGE  43323536000      10.139385           73.53795
## 5        FLASH FLOOD  16140811510       3.777575           77.31552
## 6               HAIL  15727366720       3.680813           80.99633
## 7          HURRICANE  11868319010       2.777646           83.77398
## 8  THUNDERSTORM WIND   9701591430       2.270548           86.04453
## 9     TROPICAL STORM   7703890550       1.803009           87.84754
## 10      WINTER STORM   6688497250       1.565367           89.41290

Identify top M events for overall crop damage based on 90th percentile score

# Enumerate top M states scoring 90th percentile rank for overall crop damage
cdsReport
## Source: local data frame [12 x 4]
## 
##               EVTYPE     CROPDMG PERCENT_DAMAGE CUM_PERCENT_DAMAGE
## 1            DROUGHT 13972566000      28.454935           28.45494
## 2              FLOOD  5661968450      11.530520           39.98545
## 3        RIVER FLOOD  5029459000      10.242423           50.22788
## 4          ICE STORM  5022113500      10.227464           60.45534
## 5               HAIL  3025954450       6.162314           66.61766
## 6          HURRICANE  2741910000       5.583861           72.20152
## 7  HURRICANE/TYPHOON  2607872800       5.310896           77.51241
## 8        FLASH FLOOD  1421317100       2.894492           80.40691
## 9       EXTREME COLD  1312973000       2.673851           83.08076
## 10 THUNDERSTORM WIND  1159505100       2.361316           85.44207
## 11      FROST/FREEZE  1094186000       2.228294           87.67037
## 12        HEAVY RAIN   733399800       1.493558           89.16393

Figure 3: Comparison of Overall Property Damage and Crop Damage Costs (U.S Dollars) for top M events

# Comparison of Property and Crop damages for top M events

barchart(DAMAGE_COST ~ EVTYPE | CATEGORY, 
         data = pcDamageComparisonDS,
         main="Overall Property Damage and Crop Damage Costs (U.S $) for top M events", 
         ylab="Overall Property Damage or Crop Damage Costs (U.S $) Per Event", 
         xlab="Event Type",
         index.cond=list(c(2,1)),
         col=c("red"),
         layout=c(2,1),
         scales=list(x=list(rot=45))
)