This is a RR Course Project report which identifies severe weather events based on processing and analysis of U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database. The preprocessing of the data attempts to clean up missing, redundant, and inconsistent values of event types in the source data to prevent data contamination. This report has tentative conclusions to help National or State agencies in optimal allocation of limited resources available by focussing attention to a small but dominant subset of a large collection of severe weather events, which contribute disproportionately to elevated loss of human life, injuries, and damage to property and crops. The report also attempts to explore the impact of these events in terms of geographic and temporal dominance to bring further focus on targeted allocation of resources.
The following is a list of tentative conclusions and conjectures based on the analysis:
A relatively small subset of severe weather events (15 out of more than 220 possible unique events in the NOAA Storm database) appear to cause disproportionate extent of loss of human life and injuries. These N=15 events appear to score a 90th percentile rank in relation to impact from all other events.
This subset of N=15 severe weather events (designated as Event Group A Top N in the report) appear to have a significantly higher contribution (yearly mean fatality, and yearly mean injury) in comparison to all other events when the events are summarized per year) spanning all the years from 1950 to 2012.
A subset of N=25 states/regions have been scored at 90th percentile rank in terms of the overall national casualties.
A different, independent subset of severe weather events (10 out of more than 220 possible unique events in the NOAA Storm database) appear to cause disproportionate extent of property damage measured in U.S dollars. These N=10 events are scored at 90th percentile rank in relation to property damage impact from all other events.
A different, independent subset of severe weather events (13 out of more than 220 possible unique events in the NOAA Storm database) appear to cause disproportionate extent of crop damage measured in U.S dollars. These N=13 events are scored at 90th percentile rank in relation to crop damage impact from all other events.
The subsets identified above are enumerated in the results section of the report.
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dplyr)
require(lattice)
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.1.3
library(lattice)
require(lubridate)
## Loading required package: lubridate
## Warning: package 'lubridate' was built under R version 3.1.3
library(lubridate)
require(knitr)
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 3.1.3
library(knitr)
library(dplyr)
library(lattice)
require(stats)
library
require(stringr)
## Loading required package: stringr
library(stringr)
require(reshape2)
## Loading required package: reshape2
library(reshape2)
opts_chunk$set(echo=TRUE, cache=TRUE)
We read in the NOAA Storm datase from the raw csv file included in the zip archive. The data is a comma delimited file, and missing values are inherently coded as NA fields in the input data file. The header data is included in the reading.
repdata.data.StormData <- read.csv(bzfile("./repdata-data-StormData.csv.bz2"))
dim(repdata.data.StormData)
## [1] 902297 37
# summary(repdata.data.StormData)
The NOAA Storm dataset has a number of redundant labels for Event types. In this section, we try to clean up the data by correct labelling using cleanup code. In this section, we also attempt to reduce the dimensionality of data in the Storm dataset by looking at only the consequential events (fatalities > 0 or injuries > 0) for further analysis.
# Check for any non-numeric values in FATALITIES and INJURIES columns
sum(any(!is.numeric(repdata.data.StormData$FATALITIES)))
## [1] 0
sum(any(!is.numeric(repdata.data.StormData$INJURIES)))
## [1] 0
# Across the United States, which types of events (as indicated in the EVTYPE variable)
# are most harmful with respect to population health?
mostHarmfulEventDS <- filter(repdata.data.StormData,FATALITIES > 0 | INJURIES > 0)
# We realize this consequential filter reduces the dimensionality of the dataset
# significantly from 902297 rows to 21929 rows, which is approximately 98% reduction
dim(mostHarmfulEventDS)
## [1] 21929 37
# Introduce a new variable to sum the fatalities and injuries for each observation
CASUALTIES <- mostHarmfulEventDS$FATALITIES + mostHarmfulEventDS$INJURIES
casualtyDS <- select(mostHarmfulEventDS, EVTYPE, STATE, BGN_DATE, FATALITIES, INJURIES)
casualtyDS <- mutate(casualtyDS, CASUALTIES)
# Look for data anomalies and inconsistencies in EVTYPE variable in the new dataset
events <- select(casualtyDS,EVTYPE)
uniqEvents <- unique(events)
dim(uniqEvents)
## [1] 220 1
# Notice the events boil down to 220. A quick visual inspection of the unique events
# shows the EVTTYPE variable has a few noticeable classification errors. FOr example,
# the observation with EVTYPE = "HIGH" is ambiguous, and needs removal
#sort(uniqEvents$EVTYPE)
filter(casualtyDS,EVTYPE == "HIGH")
## EVTYPE STATE BGN_DATE FATALITIES INJURIES CASUALTIES
## 1 HIGH CA 11/16/1994 0:00:00 0 1 1
casualtyDS <- filter(casualtyDS, EVTYPE != "HIGH")
casualtyDS$EVTYPE <- gsub("AVALANCE","AVALANCHE",casualtyDS$EVTYPE)
#casualtyDS2 <- casualtyDS
casualtyDS$EVTYPE <- unlist(lapply(casualtyDS$EVTYPE,function(x){toupper(x)}))
dim(casualtyDS)
## [1] 21928 6
casualtyDS$EVTYPE <- gsub("COASTALSTORM$","COASTAL STORM",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("GUSTY WIND$","GUSTY WINDS",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("HEAT$","HEAT WAVE",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("HEAT WAVES$","HEAT WAVE",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("HIGH WINDS$","HIGH WIND",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("LANDSLIDES$","LANDSLIDE",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("RIP CURRENTS$","RIP CURRENT",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("WILD FIRES$","WILDFIRE",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("WINDS$","WIND",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("TSTM WIND$","THUNDERSTORM WIND",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("THUNDERSTORMS WINDS$","THUNDERSTORM WIND",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("THUNDERSTORM WINDS$","THUNDERSTORM WIND",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("THUNDERSTORM WINDS$","THUNDERSTORM WIND",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("STRONG WIND$","STRONG WINDS",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("TSTM WIND/HAIL$","THUNDERSTORM WINDS/HAIL",casualtyDS$EVTYPE)
casualtyDS$EVTYPE <- gsub("WATERSPOUT TORNADO$","WATERSPOUT/TORNADO",casualtyDS$EVTYPE)
Summarization of the dataset by EVTYPE and ranking of the data to get the dominant subset of events based on ranking the events using a score of 90th percentile ranking.
# Summarize fatalities, injuries, and overall casualties of the new subset by EVTTYPE
casualtySummary <- summarize(group_by(casualtyDS, EVTYPE), SUM_CASUALTIES=sum(CASUALTIES), SUM_FATALITIES=sum(FATALITIES), SUM_INJURIES=sum(INJURIES))
# To extract the severest of the casualties, apply filter: fatalities > 0 and injuries > 0
# to the summary observations
#
severeCasualtySummary <- filter(casualtySummary, SUM_FATALITIES > 0 & SUM_INJURIES > 0)
severeCasualtySummary <- select(severeCasualtySummary, EVTYPE, FATALITIES=SUM_FATALITIES, INJURIES=SUM_INJURIES, CASUALTIES=SUM_CASUALTIES)
# Rank casuality summary by highest to lowest fatalities
severeCasualtySummary <- arrange(severeCasualtySummary, desc(FATALITIES))
PERCENT_FATALITIES <- (100.0 * severeCasualtySummary$FATALITIES/sum(severeCasualtySummary$FATALITIES,na.rm=TRUE))
#View(PERCENT_FATALITIES)
severeCasualtySummary <- mutate(severeCasualtySummary,PERCENT_FATALITIES)
CUM_PERCENT_FATALITIES <- 1:length(PERCENT_FATALITIES)
for(i in 1:length(PERCENT_FATALITIES)) {
sum <- 0
for(j in 1:i) {
sum <- sum + PERCENT_FATALITIES[j]
}
CUM_PERCENT_FATALITIES[i] = sum
}
severeCasualtySummary = mutate(severeCasualtySummary,CUM_PERCENT_FATALITIES)
names(severeCasualtySummary)
## [1] "EVTYPE" "FATALITIES"
## [3] "INJURIES" "CASUALTIES"
## [5] "PERCENT_FATALITIES" "CUM_PERCENT_FATALITIES"
For data visualization and easy grasp of the relative magnitude of each event, take the top N events which contribute to 90th percentile ranking of overall fatalities over all years across United States. Use the 90th percentile rank to arrive at the dominant severe weather events.
topNEvents <- filter(severeCasualtySummary, CUM_PERCENT_FATALITIES < 90.1)
#topNEvents <- severeCasualtySummary[1:20,]
# Enumerate the top N events scoring 90th percentile score in descending order
topNEvents$EVTYPE
## [1] "TORNADO" "EXCESSIVE HEAT WAVE"
## [3] "HEAT WAVE" "FLASH FLOOD"
## [5] "LIGHTNING" "THUNDERSTORM WIND"
## [7] "RIP CURRENT" "FLOOD"
## [9] "HIGH WIND" "AVALANCHE"
## [11] "WINTER STORM" "EXTREME COLD"
## [13] "HEAVY SNOW" "EXTREME COLD/WIND CHILL"
## [15] "STRONG WINDS"
topNSummaryMelt <- melt(topNEvents,id.vars=c("EVTYPE"),meas.vars=c("FATALITIES","INJURIES","CASUALTIES","PERCENT_FATALITIES"))
names(topNSummaryMelt) <- c("EVTYPE","CATEGORY","VALUE")
Next, we explore how the temporal impact of these top N events ranked by top fatalities is distributed over time across all states in U.S.
casualtyDS <- mutate(casualtyDS,YEAR=year(parse_date_time(BGN_DATE,"%m%d%Y %H%M%S")))
names(casualtyDS)
## [1] "EVTYPE" "STATE" "BGN_DATE" "FATALITIES" "INJURIES"
## [6] "CASUALTIES" "YEAR"
#View(head(casualtyDS,1000))
casualtySummary2 <- summarize(group_by(casualtyDS, EVTYPE, YEAR), FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES))
dim(casualtySummary2)
## [1] 808 4
#topNEvents$EVTYPE
topNEventsSummary2 <- filter(casualtySummary2,EVTYPE %in% topNEvents$EVTYPE)
#View(topNEventsSummary2)
topNEventsSummaryByYear <- summarize(group_by(topNEventsSummary2, YEAR), CUM_FATALITIES=sum(FATALITIES,na.rm=TRUE), CUM_INJURIES=sum(INJURIES, na.rm=TRUE))
meanAggregateFatalities <- mean(topNEventsSummaryByYear$CUM_FATALITIES,na.rm=TRUE)
MEAN_FATALITIES <- rep(meanAggregateFatalities, nrow(topNEventsSummaryByYear))
length(MEAN_FATALITIES)
## [1] 62
meanAggregateInjuries <- mean(topNEventsSummaryByYear$CUM_INJURIES,na.rm=TRUE)
MEAN_INJURIES <- rep(meanAggregateInjuries, nrow(topNEventsSummaryByYear))
topNEventsSummaryByYear <- mutate(topNEventsSummaryByYear, MEAN_FATALITIES, MEAN_INJURIES)
otherEventsSummary <- filter(casualtySummary2,!EVTYPE %in% topNEvents$EVTYPE)
otherEventsSummaryByYear <- summarize(group_by(otherEventsSummary, YEAR), CUM_FATALITIES=sum(FATALITIES,na.rm=TRUE), CUM_INJURIES=sum(INJURIES, na.rm=TRUE))
meanAggregateFatalities2 <- mean(otherEventsSummaryByYear$CUM_FATALITIES,na.rm=TRUE)
MEAN_FATALITIES2 <- rep(meanAggregateFatalities2, nrow(otherEventsSummaryByYear))
length(MEAN_FATALITIES2)
## [1] 28
meanAggregateInjuries2 <- mean(otherEventsSummaryByYear$CUM_INJURIES,na.rm=TRUE)
MEAN_INJURIES2 <- rep(meanAggregateInjuries2, nrow(otherEventsSummaryByYear))
length(MEAN_INJURIES2)
## [1] 28
# Mean of aggregate fatalities per year of top N severe events
meanAggregateFatalities
## [1] 216.5484
# Mean of aggregate fatalities per year of all other severe events (excluding top N events)
meanAggregateFatalities2
## [1] 61.39286
# Mean of aggregate injuries per year of top N severe events
meanAggregateInjuries
## [1] 2073.145
# Mean of aggregate injuries per year of all other severe events (excluding top N events)
meanAggregateInjuries2
## [1] 428.2857
otherEventsSummaryByYear <- mutate(otherEventsSummaryByYear, MEAN_FATALITIES2, MEAN_INJURIES2)
differentiator1 <- as.factor(rep("TOP_N", nrow(topNEventsSummaryByYear)))
topNEventsSummaryByYear <- mutate(topNEventsSummaryByYear, differentiator1)
differentiator2 <- as.factor(rep("OTHER_THAN_TOP_N", nrow(otherEventsSummaryByYear)))
otherEventsSummaryByYear <- mutate(otherEventsSummaryByYear, differentiator2)
names(topNEventsSummaryByYear) <- c("YEAR","CUM_FATALITIES","CUM_INJURIES","MEAN_FATALITIES","MEAN_INJURIES","CATEGORY")
names(otherEventsSummaryByYear) <- c("YEAR","CUM_FATALITIES","CUM_INJURIES","MEAN_FATALITIES","MEAN_INJURIES","CATEGORY")
yearlyComparisonDS <- rbind(topNEventsSummaryByYear, otherEventsSummaryByYear)
yearlyComparisonDS$CATEGORY <- gsub("TOP_N$","EVENT_GROUP_A_TOP_N",yearlyComparisonDS$CATEGORY)
yearlyComparisonDS$CATEGORY <- gsub("OTHER_THAN_EVENT_GROUP_A_TOP_N$","EVENT_GROUP_B_OTHERS",yearlyComparisonDS$CATEGORY)
Next we explore the regional impact of the top N severe weather events to identify top M states which take the brunt of the fatility and injury impact. Let us use the 90th percentile score to get the top M states for summarization
casualtySummary3 <- summarize(group_by(casualtyDS, EVTYPE, STATE), FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES))
dim(casualtySummary3)
## [1] 1304 4
names(casualtySummary3)
## [1] "EVTYPE" "STATE" "FATALITIES" "INJURIES"
topNEventsSummary3 <- filter(casualtySummary3,EVTYPE %in% topNEvents$EVTYPE)
topNEventsSummary3 <- mutate(topNEventsSummary3, CASUALTIES = FATALITIES + INJURIES)
#View(topNEventsSummary3)
topNEventsSummaryByState <- summarize(group_by(topNEventsSummary3, STATE), CUM_CASUALTIES=sum(CASUALTIES))
names(topNEventsSummaryByState)
## [1] "STATE" "CUM_CASUALTIES"
topNEventsSummaryByState <- arrange(topNEventsSummaryByState, desc(CUM_CASUALTIES))
#### Rank casuality summary by state from highest to lowest total casualties
PERCENT_CASUALTIES <- (100.0 * topNEventsSummaryByState$CUM_CASUALTIES/sum(topNEventsSummaryByState$CUM_CASUALTIES))
#View(PERCENT_CASUALTIES)
topNEventsSummaryByState <- mutate(topNEventsSummaryByState,PERCENT_CASUALTIES)
CUM_PERCENT_CASUALTIES <- 1:length(PERCENT_CASUALTIES)
for(i in 1:length(PERCENT_CASUALTIES)) {
sum <- 0
for(j in 1:i) {
sum <- sum + PERCENT_CASUALTIES[j]
}
CUM_PERCENT_CASUALTIES[i] = sum
}
#View(CUM_PERCENT_CASUALTIES)
topNEventsSummaryByState = mutate(topNEventsSummaryByState, CUM_PERCENT_CASUALTIES)
# identify top N states covering 90 percentile of overall casualties (fatalities + injuries)
topNStatesByCasualties <- filter(topNEventsSummaryByState, CUM_PERCENT_CASUALTIES < 90.1)
# topNStatesByCasualties
# Top 20 states make up 82 percentile of total casualties
topNStatesByCasualties[1:20,]
## Source: local data frame [20 x 4]
##
## STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1 TX 18163 12.794359 12.79436
## 2 MO 9710 6.839907 19.63427
## 3 AL 9397 6.619424 26.25369
## 4 MS 7084 4.990103 31.24379
## 5 IL 6870 4.839357 36.08315
## 6 OK 6076 4.280049 40.36320
## 7 AR 6010 4.233557 44.59676
## 8 TN 5685 4.004621 48.60138
## 9 OH 5565 3.920091 52.52147
## 10 FL 5377 3.787660 56.30913
## 11 IN 5013 3.531252 59.84038
## 12 GA 4938 3.478420 63.31880
## 13 MI 4909 3.457992 66.77679
## 14 KY 3695 2.602828 69.37962
## 15 NC 3670 2.585217 71.96484
## 16 PA 3621 2.550701 74.51554
## 17 KS 3569 2.514071 77.02961
## 18 LA 3341 2.353463 79.38307
## 19 IA 2861 2.015342 81.39841
## 20 MN 2389 1.682857 83.08127
# Remaining states if the number in the result is more than 20
if(nrow(topNStatesByCasualties) > 20)
{
topNStatesByCasualties[21:nrow(topNStatesByCasualties),]
}
## Source: local data frame [5 x 4]
##
## STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1 WI 2387 1.681448 84.76272
## 2 MA 2220 1.563810 86.32653
## 3 SC 1901 1.339100 87.66563
## 4 VA 1714 1.207374 88.87300
## 5 NE 1424 1.003092 89.87609
sum(any(is.na(repdata.data.StormData$PROPDMG)))
## [1] 0
sum(any(is.na(repdata.data.StormData$PROPDMGEXP)))
## [1] 0
unique(repdata.data.StormData$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
summarize(group_by(repdata.data.StormData,PROPDMGEXP),PROPEXP_COUNT=n())
## Source: local data frame [19 x 2]
##
## PROPDMGEXP PROPEXP_COUNT
## 1 465934
## 2 - 1
## 3 ? 8
## 4 + 5
## 5 0 216
## 6 1 25
## 7 2 13
## 8 3 4
## 9 4 4
## 10 5 28
## 11 6 4
## 12 7 5
## 13 8 1
## 14 B 40
## 15 h 1
## 16 H 6
## 17 K 424665
## 18 m 7
## 19 M 11330
propertyDamageEvents <- filter(repdata.data.StormData, PROPDMG > 0 & PROPDMGEXP %in% c("K","M","B"))
dim(propertyDamageEvents)
## [1] 238840 37
sum(any(is.na(repdata.data.StormData$CROPDMG)))
## [1] 0
sum(any(is.na(repdata.data.StormData$CROPDMGEXP)))
## [1] 0
sum(any(!is.numeric(repdata.data.StormData$CROPDMG)))
## [1] 0
unique(repdata.data.StormData$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
summarize(group_by(repdata.data.StormData,CROPDMGEXP),CROPEXP_COUNT=n())
## Source: local data frame [9 x 2]
##
## CROPDMGEXP CROPEXP_COUNT
## 1 618413
## 2 ? 7
## 3 0 19
## 4 2 1
## 5 B 9
## 6 k 21
## 7 K 281832
## 8 m 1
## 9 M 1994
cropDamageEvents <- filter(repdata.data.StormData, CROPDMG > 0 & CROPDMGEXP != "" & CROPDMGEXP %in% c("b","B","k","K","m","M"))
cropDamageEvents$CROPDMGEXP <-gsub("k","K",cropDamageEvents$CROPDMGEXP)
cropDamageEvents$CROPDMGEXP <-gsub("m","M",cropDamageEvents$CROPDMGEXP)
summarize(group_by(cropDamageEvents,CROPDMGEXP),CROPEXP_COUNT=n())
## Source: local data frame [3 x 2]
##
## CROPDMGEXP CROPEXP_COUNT
## 1 B 7
## 2 K 20158
## 3 M 1919
propertyDamageEvents$PROPDMGEXP <- gsub("K","1000", propertyDamageEvents$PROPDMGEXP)
propertyDamageEvents$PROPDMGEXP <- gsub("M","1000000", propertyDamageEvents$PROPDMGEXP)
propertyDamageEvents$PROPDMGEXP <- gsub("B","1000000000", propertyDamageEvents$PROPDMGEXP)
cropDamageEvents$CROPDMGEXP <- gsub("K","1000", cropDamageEvents$CROPDMGEXP)
cropDamageEvents$CROPDMGEXP <- gsub("M","1000000", cropDamageEvents$CROPDMGEXP)
cropDamageEvents$CROPDMGEXP <- gsub("B","1000000000", cropDamageEvents$CROPDMGEXP)
propertyDamageEvents$PROPDMG <- propertyDamageEvents$PROPDMG * as.numeric(propertyDamageEvents$PROPDMGEXP)
cropDamageEvents$CROPDMG <- cropDamageEvents$CROPDMG * as.numeric(cropDamageEvents$CROPDMGEXP)
#View(propertyDamageEvents)
#View(cropDamageEvents)
propertyDamageEvents <- filter(propertyDamageEvents, EVTYPE != "HIGH")
propertyDamageEvents$EVTYPE <- gsub("AVALANCE","AVALANCHE",propertyDamageEvents$EVTYPE)
#propertyDamageEvents2 <- propertyDamageEvents
propertyDamageEvents$EVTYPE <- unlist(lapply(propertyDamageEvents$EVTYPE,function(x){toupper(x)}))
dim(propertyDamageEvents)
## [1] 238840 37
propertyDamageEvents$EVTYPE <- gsub("COASTALSTORM$","COASTAL STORM",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("GUSTY WIND$","GUSTY WINDS",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("HEAT WAVES$","HEAT WAVE",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("HEAT$","HEAT WAVE",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("HIGH WINDS$","HIGH WIND",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("LANDSLIDES$","LANDSLIDE",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("RIP CURRENTS$","RIP CURRENT",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("WILD FIRES$","WILDFIRE",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("WINDS$","WIND",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("TSTM WIND$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("THUNDERSTORMS WINDS$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("THUNDERSTORM WINDS$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("THUNDERSTORM WINDS$","THUNDERSTORM WIND",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("STRONG WIND$","STRONG WINDS",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("TSTM WIND/HAIL$","THUNDERSTORM WINDS/HAIL",propertyDamageEvents$EVTYPE)
propertyDamageEvents$EVTYPE <- gsub("WATERSPOUT TORNADO$","WATERSPOUT/TORNADO",propertyDamageEvents$EVTYPE)
Let us rank the severe weather events for overall property damage based on 90th percentile score.
# Summarize fatalities, injuries, and overall casualties of the new subset by EVTTYPE
propertyDamageSummary <- summarize(group_by(propertyDamageEvents, EVTYPE), PROPDMG=sum(PROPDMG))
#View(propertyDamageSummary)
propertyDamageSummary <- arrange(propertyDamageSummary,desc(PROPDMG))
PERCENT_DAMAGE <- (100.0 * propertyDamageSummary$PROPDMG/sum(propertyDamageSummary$PROPDMG,na.rm=TRUE))
#View(PERCENT_DAMAGE)
propertyDamageSummary <- mutate(propertyDamageSummary,PERCENT_DAMAGE)
CUM_PERCENT_DAMAGE <- 1:length(PERCENT_DAMAGE)
for(i in 1:length(PERCENT_DAMAGE)) {
sum <- 0
for(j in 1:i) {
sum <- sum + PERCENT_DAMAGE[j]
}
CUM_PERCENT_DAMAGE[i] = sum
}
names(propertyDamageSummary)
## [1] "EVTYPE" "PROPDMG" "PERCENT_DAMAGE"
dim(propertyDamageSummary)
## [1] 354 3
propertyDamageSummary <- mutate(propertyDamageSummary, CUM_PERCENT_DAMAGE)
pds <- filter(propertyDamageSummary, CUM_PERCENT_DAMAGE < 90.1)
# Enumerate top N states which cover 90 percentile of the overall property damage
pds$EVTYPE
## [1] "FLOOD" "HURRICANE/TYPHOON" "TORNADO"
## [4] "STORM SURGE" "FLASH FLOOD" "HAIL"
## [7] "HURRICANE" "THUNDERSTORM WIND" "TROPICAL STORM"
## [10] "WINTER STORM"
pdsReport <- pds
pds <- select(pds, -CUM_PERCENT_DAMAGE)
Similar data cleaning of the NOAA dataset for ranking severe weather events for crop damage.
cropDamageEvents <- filter(cropDamageEvents, EVTYPE != "HIGH")
cropDamageEvents$EVTYPE <- gsub("AVALANCE","AVALANCHE",cropDamageEvents$EVTYPE)
#cropDamageEvents2 <- cropDamageEvents
cropDamageEvents$EVTYPE <- unlist(lapply(cropDamageEvents$EVTYPE,function(x){toupper(x)}))
dim(cropDamageEvents)
## [1] 22084 37
cropDamageEvents$EVTYPE <- gsub("COASTALSTORM$","COASTAL STORM",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("GUSTY WIND$","GUSTY WINDS",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("HEAT WAVES$","HEAT WAVE",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("HEAT$","HEAT WAVE",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("HIGH WINDS$","HIGH WIND",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("LANDSLIDES$","LANDSLIDE",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("RIP CURRENTS$","RIP CURRENT",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("WILD FIRES$","WILDFIRE",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("WINDS$","WIND",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("TSTM WIND$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("THUNDERSTORMS WINDS$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("THUNDERSTORM WINDS$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("THUNDERSTORM WINDS$","THUNDERSTORM WIND",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("STRONG WIND$","STRONG WINDS",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("TSTM WIND/HAIL$","THUNDERSTORM WINDS/HAIL",cropDamageEvents$EVTYPE)
cropDamageEvents$EVTYPE <- gsub("WATERSPOUT TORNADO$","WATERSPOUT/TORNADO",cropDamageEvents$EVTYPE)
Let us rank the severe weather events for overall crop damage based on 90th percentile score.
# Summarize fatalities, injuries, and overall casualties of the new subset by EVTTYPE
cropDamageSummary <- summarize(group_by(cropDamageEvents, EVTYPE), CROPDMG=sum(CROPDMG))
cropDamageSummary <- arrange(cropDamageSummary,desc(CROPDMG))
PERCENT_DAMAGE <- (100.0 * cropDamageSummary$CROPDMG/sum(cropDamageSummary$CROPDMG,na.rm=TRUE))
# View(PERCENT_DAMAGE)
cropDamageSummary <- mutate(cropDamageSummary,PERCENT_DAMAGE)
CUM_PERCENT_DAMAGE <- 1:length(PERCENT_DAMAGE)
for(i in 1:length(PERCENT_DAMAGE)) {
sum <- 0
for(j in 1:i) {
sum <- sum + PERCENT_DAMAGE[j]
}
CUM_PERCENT_DAMAGE[i] = sum
}
names(cropDamageSummary)
## [1] "EVTYPE" "CROPDMG" "PERCENT_DAMAGE"
cropDamageSummary <- mutate(cropDamageSummary, CUM_PERCENT_DAMAGE)
# Enumerate top N states which cover 90 percentile of the overall property damage
cds <- filter(cropDamageSummary, CUM_PERCENT_DAMAGE < 90.1)
cds$EVTYPE
## [1] "DROUGHT" "FLOOD" "RIVER FLOOD"
## [4] "ICE STORM" "HAIL" "HURRICANE"
## [7] "HURRICANE/TYPHOON" "FLASH FLOOD" "EXTREME COLD"
## [10] "THUNDERSTORM WIND" "FROST/FREEZE" "HEAVY RAIN"
cdsReport <- cds
cds <- select(cds, -CUM_PERCENT_DAMAGE)
Combine top M events for property damage and top M events for crop damage by row binding and add a new column for category type which enables to combine the two sets of events into a single xyplot with two panels.
dim(cropDamageSummary)
## [1] 121 4
dim(propertyDamageSummary)
## [1] 354 4
names(pds) <- c("EVTYPE","DAMAGE_COST","PERCENT_DAMAGE")
names(cds) <- c("EVTYPE","DAMAGE_COST","PERCENT_DAMAGE")
CAT_PROP_DAMAGE <- rep("PROPERTY_DAMAGE",nrow(pds))
pds <- mutate(pds, CATEGORY = CAT_PROP_DAMAGE)
CAT_CROP_DAMAGE <- rep("CROP_DAMAGE",nrow(cds))
cds <- mutate(cds, CATEGORY = CAT_CROP_DAMAGE)
names(pds)
## [1] "EVTYPE" "DAMAGE_COST" "PERCENT_DAMAGE" "CATEGORY"
names(cds)
## [1] "EVTYPE" "DAMAGE_COST" "PERCENT_DAMAGE" "CATEGORY"
pcDamageComparisonDS <- rbind(pds,cds)
# Enumerate the N events based on 90th percentile rank for top fatalities
if(nrow(topNEvents) <= 20)
{
topNEvents[1:nrow(topNEvents),]
}else topNEvents[1:20,]
## Source: local data frame [15 x 6]
##
## EVTYPE FATALITIES INJURIES CASUALTIES
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT WAVE 1903 6525 8428
## 3 HEAT WAVE 1114 2479 3593
## 4 FLASH FLOOD 978 1777 2755
## 5 LIGHTNING 816 5230 6046
## 6 THUNDERSTORM WIND 701 9353 10054
## 7 RIP CURRENT 572 529 1101
## 8 FLOOD 470 6789 7259
## 9 HIGH WIND 283 1439 1722
## 10 AVALANCHE 225 170 395
## 11 WINTER STORM 206 1321 1527
## 12 EXTREME COLD 162 231 393
## 13 HEAVY SNOW 127 1021 1148
## 14 EXTREME COLD/WIND CHILL 125 24 149
## 15 STRONG WINDS 111 301 412
## Variables not shown: PERCENT_FATALITIES (dbl), CUM_PERCENT_FATALITIES
## (dbl)
# Remaining events if the number in the result is more than 20
if(nrow(topNEvents) > 20)
{
topNEvents[21:nrow(topNEvents),]
}
Figure 1: Profile of fatalities and injuries for top N events ranked by fatalities in descending order.
# Figure 1: Comparison of Aggregate fatalities and injuries for top N events.
# Ranking is based on top aggregate of fatalities
# Note: The y-axis values are on logarithmic scase (to the base 10) to capture
# a high dynamic range of the fatality and injury estimates
barchart(VALUE ~ EVTYPE | CATEGORY,
data = topNSummaryMelt,
main="Aggregate Fatalities and Injuries for top N Events over all Years",
ylab="Overall Casualties (Fatalities/Injuries/Combined)",
index.cond=list(c(4,3,2,1)),
xlab="Event Type",
col=c("purple"),
layout=c(1,4),
scales=list(x=list(rot=45),y=list(log=10))
#scales=list(x=list(rot=45))
)
Comparison of Cumulative Fatalities/Injuries Per Year for Top N Events vs Rest of Significant Events:
Comparison of mean of aggregate fatalities for top N worst case events and mean of aggregate fatalities for the rest of the events (excluding the top N events)
# Mean of aggregate fatalities per year of top N severe events
meanAggregateFatalities
## [1] 216.5484
# Mean of aggregate fatalities per year of all other severe events (excluding top N events)
meanAggregateFatalities2
## [1] 61.39286
Comparison of mean of aggregate injuries for top N worst case events and mean of aggregate fatalities for the rest of the events (excluding the top N events)
# Mean of aggregate injuries per year of top N severe events
meanAggregateInjuries
## [1] 2073.145
# Mean of aggregate injuries per year of all other severe events (excluding top N events)
meanAggregateInjuries2
## [1] 428.2857
Figure 2: Comparison of Cumulative Fatalities/Injuries Per Year for Top N Events vs Rest of Events
# Figure 2
xyplot( CUM_FATALITIES + CUM_INJURIES + MEAN_FATALITIES + MEAN_INJURIES ~ YEAR | CATEGORY,
data = yearlyComparisonDS,
type = "l",
lty = list(c(1),c(1),c(2),c(2)),
lwd = c(2),
ylab="Total Fatalities or Injuries Per Year",
xlab="Year of Event Occurrence",
main="Overall Fatalities/Injuries Per Year for Top N vs Rest of Events",
index.cond=list(c(2,1)),
auto.key=list(space="top", rows=4,
points=FALSE, lines=TRUE,
col=c("red","blue","red", "blue")),
col=c("red","blue","red", "blue"),
layout=c(1,2),
scales=list(x=list(rot=45))
)
Breakdown of overall casualties by each state for the top N severe weather events:
# Breakdown of overall casualties by state for the top N events
# Top 20 states make up 82 percentile of total casualties
topNStatesByCasualties[1:20,]
## Source: local data frame [20 x 4]
##
## STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1 TX 18163 12.794359 12.79436
## 2 MO 9710 6.839907 19.63427
## 3 AL 9397 6.619424 26.25369
## 4 MS 7084 4.990103 31.24379
## 5 IL 6870 4.839357 36.08315
## 6 OK 6076 4.280049 40.36320
## 7 AR 6010 4.233557 44.59676
## 8 TN 5685 4.004621 48.60138
## 9 OH 5565 3.920091 52.52147
## 10 FL 5377 3.787660 56.30913
## 11 IN 5013 3.531252 59.84038
## 12 GA 4938 3.478420 63.31880
## 13 MI 4909 3.457992 66.77679
## 14 KY 3695 2.602828 69.37962
## 15 NC 3670 2.585217 71.96484
## 16 PA 3621 2.550701 74.51554
## 17 KS 3569 2.514071 77.02961
## 18 LA 3341 2.353463 79.38307
## 19 IA 2861 2.015342 81.39841
## 20 MN 2389 1.682857 83.08127
# Remaining states if the number in the result is more than 20
if(nrow(topNStatesByCasualties) > 20)
{
topNStatesByCasualties[21:nrow(topNStatesByCasualties),]
}
## Source: local data frame [5 x 4]
##
## STATE CUM_CASUALTIES PERCENT_CASUALTIES CUM_PERCENT_CASUALTIES
## 1 WI 2387 1.681448 84.76272
## 2 MA 2220 1.563810 86.32653
## 3 SC 1901 1.339100 87.66563
## 4 VA 1714 1.207374 88.87300
## 5 NE 1424 1.003092 89.87609
Identify top M events for overall property damage based on 90th percentile score
# Enumerate top M states scoring 90th percentile rank for overall property damage
pdsReport
## Source: local data frame [10 x 4]
##
## EVTYPE PROPDMG PERCENT_DAMAGE CUM_PERCENT_DAMAGE
## 1 FLOOD 144657709800 33.855504 33.85550
## 2 HURRICANE/TYPHOON 69305840000 16.220249 50.07575
## 3 TORNADO 56925660480 13.322808 63.39856
## 4 STORM SURGE 43323536000 10.139385 73.53795
## 5 FLASH FLOOD 16140811510 3.777575 77.31552
## 6 HAIL 15727366720 3.680813 80.99633
## 7 HURRICANE 11868319010 2.777646 83.77398
## 8 THUNDERSTORM WIND 9701591430 2.270548 86.04453
## 9 TROPICAL STORM 7703890550 1.803009 87.84754
## 10 WINTER STORM 6688497250 1.565367 89.41290
Identify top M events for overall crop damage based on 90th percentile score
# Enumerate top M states scoring 90th percentile rank for overall crop damage
cdsReport
## Source: local data frame [12 x 4]
##
## EVTYPE CROPDMG PERCENT_DAMAGE CUM_PERCENT_DAMAGE
## 1 DROUGHT 13972566000 28.454935 28.45494
## 2 FLOOD 5661968450 11.530520 39.98545
## 3 RIVER FLOOD 5029459000 10.242423 50.22788
## 4 ICE STORM 5022113500 10.227464 60.45534
## 5 HAIL 3025954450 6.162314 66.61766
## 6 HURRICANE 2741910000 5.583861 72.20152
## 7 HURRICANE/TYPHOON 2607872800 5.310896 77.51241
## 8 FLASH FLOOD 1421317100 2.894492 80.40691
## 9 EXTREME COLD 1312973000 2.673851 83.08076
## 10 THUNDERSTORM WIND 1159505100 2.361316 85.44207
## 11 FROST/FREEZE 1094186000 2.228294 87.67037
## 12 HEAVY RAIN 733399800 1.493558 89.16393
Figure 3: Comparison of Overall Property Damage and Crop Damage Costs (U.S Dollars) for top M events
# Comparison of Property and Crop damages for top M events
barchart(DAMAGE_COST ~ EVTYPE | CATEGORY,
data = pcDamageComparisonDS,
main="Overall Property Damage and Crop Damage Costs (U.S $) for top M events",
ylab="Overall Property Damage or Crop Damage Costs (U.S $) Per Event",
xlab="Event Type",
index.cond=list(c(2,1)),
col=c("red"),
layout=c(2,1),
scales=list(x=list(rot=45))
)