An analysis is done on the database of the U.S. National Oceanic and Atmospheric Administration (NOAA) to determine the top severe weather events that have the most impact in terms of fatalities, injuries and property damage. Although the data is from 1950 to 2011, only records from 1996 onwards were considered since the tracking of the 48 different weather event types were only observed during this period. The data was tidied by reclassifying the duplicate event type entries. The value for the property damages were calculated by conversion of the exponent variables. The topmost severe weather event that has the highest number fatalities is excessive heat (~1800). Tornado is the leading cause of total incidents: fatalities and injuries combined (~22,000). Flood is the top event that has the most economic impact with almost $150 billion of property damage.
dataFileName <- "./repdata-data-StormData.csv.bz2"
if (!file.exists(dataFileName)){
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url = url, destfile = dataFileName,method = "curl")
}
if(!exists("stormData")){
stormData <- read.csv("repdata-data-StormData.csv.bz2")
}
library(dplyr)
library(lubridate)
library(ggplot2)
dim(stormData)
## [1] 902297 37
colnames(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
A quick review of the data shows that there are 902297 observation and 37 variables. Not all the avariables are needed for the analyis so we select only variables that provide information for the health and economic impact :BGN_DATE,EVTYPE,FATALITIES, INJURIES, PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP.
stormDataRed <- stormData[,c("BGN_DATE","EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
At this point we only select the entries that were recorded after 1995.
data1 <- mutate(stormDataRed, BGN_DATE=mdy_hms(BGN_DATE))
data1 <- filter(data1, year(BGN_DATE)> 1995)
dim(data1)
## [1] 653530 8
We only select the observations that resulted to any of the following : fatality, injury, crop damage or property damage
data1 <- data1[data1$FATALITIES > 0 | data1$INJURIES > 0 | data1$CROPDMG > 0 | data1$PROPDMG > 0,]
dim(data1)
## [1] 201318 8
We have narrowed down our dataset to 201318 observations.
A quick look at the factors for the event types reveals 222 unique entries. This is far greater than the standard entries of 48 as indicated here Different weather types
We therefore need to tidy the entries for event type. By just transforming the event entries to lower case we have reduced the unique values from 222 to 186
length(unique(data1$EVTYPE))
## [1] 222
data1$EVTYPE <- tolower(data1$EVTYPE)
length(unique(data1$EVTYPE))
## [1] 186
A check of the top 50 most frequently occuring entries show duplication such as tstm wind and thunderstorm wind and rip current and rip currents.
data1$EVTYPE <- factor(data1$EVTYPE)
uniqueEvtype <- sort(table(data1$EVTYPE), decreasing=TRUE)
uniqueEvtype[1:50]
##
## tstm wind thunderstorm wind hail
## 61776 43097 22679
## flash flood tornado lightning
## 19011 12366 11152
## flood high wind strong wind
## 9513 5402 3369
## winter storm heavy rain heavy snow
## 1460 1047 1029
## wildfire urban/sml stream fld excessive heat
## 847 702 685
## ice storm tstm wind/hail tropical storm
## 631 441 410
## winter weather wild/forest fire rip current
## 405 381 364
## avalanche drought rip currents
## 264 258 239
## blizzard lake-effect snow landslide
## 228 194 190
## storm surge extreme cold heat
## 169 166 164
## coastal flood light snow winter weather/mix
## 153 141 139
## high surf hurricane frost/freeze
## 128 126 117
## extreme cold/wind chill marine tstm wind fog
## 111 109 101
## dust storm cold/wind chill dust devil
## 96 90 84
## river flood dry microburst hurricane/typhoon
## 80 75 72
## wind dense fog heavy surf/high surf
## 67 58 50
## storm surge/tide marine strong wind
## 47 46
The top 50 in terms ofmost frequently occurring event types were analyzed. Entries that seemed intuitively belonging to a standard event type entry are appropriately assigned. Others that could not be intuitively assigned are labeled as unclassified. White spaces were also removed.
tidytstm <- function(x) gsub("tstm", "thunderstorm", x)
removews <- function(x) gsub("^\\s+|\\s+$", "", x)
tidywinds <- function(x) {y <- gsub(" winds", " wind", x)
gsub("winds", "wind", y)
}
tidyothers<-function(x) gsub("rip currents","rip current",x )
data1<-mutate(data1, EVTYPE=removews(EVTYPE), EVTYPE = tidytstm(EVTYPE), EVTYPE = tidywinds(EVTYPE), EVTYPE=tidyothers(EVTYPE))
data1$EVTYPE <- factor(data1$EVTYPE)
coastalflood <- c("coastal flood","coastal flooding")
densefog <- c("dense fog","fog")
fire <-c("wild/forest fire","wildfire")
flood <-c("urban/sml stream fld","fld")
frostfreeze <-c("frost","freeze")
hail <- c("small hail","hail")
heavysnow <-c("excessive snow","heavy snow","snow")
highwind <-c("gusty wind","high wind")
hurricane <-c("hurricane","typhoon","hurricane/typhoon")
surf <-c("heavy surf/high surf","heavy surf")
thunderstorm <- c("thunderstorm wind (g45)", "thunderstorm wind/hail","thunderstorm wind (g40)")
unclassified <-c("landslide","dry microburst","light snow","glaze","icy roads", "mixed precipitation","gradient wind","freezing drizzle","mudslide","coastal storm","freezing rain","light freezing rain","other","fog","river flood","wind")
windchill <-c("extreme windchill","cold/wind chill", "extreme cold/wind chill","cold","extreme cold")
winter <-c("winter weather/mix","winter weather")
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% thunderstorm] <- "thunderstorm wind"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% coastalflood] <- "coastal flood"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% flood] <- "flood"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% frostfreeze] <- "frost/freeze"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% hail] <- "hail"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% heavysnow] <- "heavy snow"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% fire] <- "wildfire"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% windchill] <- "extreme cold/wind chill"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% winter] <- "winter weather"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% hurricane] <- "hurricane/typhoon"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% surf] <- "high surf"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% highwind] <- "high wind"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% unclassified] <- "unclassified"
unique.evtype <- sort(table(data1$EVTYPE), decreasing=TRUE)
unique.evtype[1:30]
##
## thunderstorm wind hail flash flood
## 105363 22690 19012
## tornado lightning flood
## 12366 11152 10215
## high wind strong wind winter storm
## 5446 3414 1460
## wildfire heavy snow heavy rain
## 1228 1070 1047
## unclassified excessive heat ice storm
## 792 685 631
## rip current winter weather tropical storm
## 603 544 410
## extreme cold/wind chill avalanche drought
## 406 264 258
## blizzard high surf hurricane/typhoon
## 228 207 207
## lake-effect snow coastal flood storm surge
## 194 188 169
## heat marine thunderstorm wind frost/freeze
## 164 142 132
length(unique.evtype)
## [1] 139
After the series of cleansing we still have 139 unique entries (still over the expected 48 standard entries). However the top 30th event entry only has a count of 132 compared to the top entry that has ~10500 occurrences. Since the objective of this analysis is to identify the topmost weather events we can make a practical assumption that the other entries will have minimal impact on the ranking.
A quick look at Property and Crop Damage data shows that there is an exponent factor. We presume here that k=K for thousand, m=M for million, h=H for hundred, b=B for billion and other entries =0. So when we see a CROPDMG =5 and CROPDMGEXP =h we assume that this is equal to 5 hundred or 5* 10^2. To get the actual value we convert the exponent value (CROPDMGEXP/PROPDMGEXP) to their corresponding numbers.(i.e. h=2, k=3, b=9 etc..) We raise ten to this number then multiply to the CROPDMG/PROPDMG to get the actual value. We add the property and crop damage to get the total value of the damage.
levels(data1$PROPDMGEXP)
## [1] "" "+" "-" "0" "1" "2" "3" "4" "5" "6" "7" "8" "?" "B" "H" "K" "M"
## [18] "h" "m"
levels(data1$CROPDMGEXP)
## [1] "" "0" "2" "?" "B" "K" "M" "k" "m"
data1$CROPDMGEXP <- tolower(data1$CROPDMGEXP)
data1$PROPDMGEXP <- tolower(data1$PROPDMGEXP)
k3 <- function(x) gsub("k", "3", x)
h2 <- function(x) gsub("h", "2", x)
b9 <- function(x) gsub("b", "9", x)
m6 <- function(x) gsub("m", "6", x)
o0 <- function(x) gsub("[[:punct:]]", "0", x)
data1<-mutate(data1, CROPDMGEXP = h2(CROPDMGEXP), CROPDMGEXP = k3(CROPDMGEXP),
CROPDMGEXP = m6(CROPDMGEXP), CROPDMGEXP = b9(CROPDMGEXP),
CROPDMGEXP = o0(CROPDMGEXP),
PROPDMGEXP = h2(PROPDMGEXP), PROPDMGEXP = k3(PROPDMGEXP),
PROPDMGEXP = m6(PROPDMGEXP), PROPDMGEXP = b9(PROPDMGEXP),
PROPDMGEXP = o0(PROPDMGEXP),
PROPDMGEXP = as.numeric(PROPDMGEXP),CROPDMGEXP = as.numeric(CROPDMGEXP))
data2<-mutate(data1, CROPDMG2 = CROPDMG * (10^CROPDMGEXP),
PROPDMG2 = PROPDMG * (10^PROPDMGEXP))
data2$CROPDMG2[is.na(data2$CROPDMG2)]<-0
data2$PROPDMG2[is.na(data2$PROPDMG2)]<-0
data2<-mutate(data2,TOTDMG=CROPDMG2+PROPDMG2)
To determine the event type that has the most impact on human health we filter our data to entries that resulted to fatality or an injury. We then sort the data to top 10 rankings.
datahealth<- data2[data2$FATALITIES > 0 | data2$INJURIES > 0,]
dim(datahealth)
## [1] 12764 11
datahealth <-select(datahealth,EVTYPE,INJURIES,FATALITIES)
datahealth <-mutate(datahealth,TOTAL=INJURIES+FATALITIES)
datahealthsorted <- aggregate(FATALITIES ~ EVTYPE, data = datahealth, sum)
datahealthsorted <- datahealthsorted[order(-datahealthsorted$FATALITIES), ][1:10, ]
datahealthsorted$EVTYPE <- factor(datahealthsorted$EVTYPE, levels = datahealthsorted$EVTYPE)
datahealthsorted2 <- aggregate(INJURIES ~ EVTYPE, data = datahealth, sum)
datahealthsorted2 <- datahealthsorted2[order(-datahealthsorted2$INJURIES), ][1:10, ]
datahealthsorted2$EVTYPE <- factor(datahealthsorted2$EVTYPE, levels = datahealthsorted2$EVTYPE)
To determine the event type that has the most economic impact on we filter our data to entries that have total damage (property and crop damage) greater than zero.We then sort the data to top 10 rankings.
datafin<- data2[data2$TOTDMG > 0,]
datafinsorted <- aggregate(TOTDMG ~ EVTYPE, data = datafin, sum)
datafinsorted <- datafinsorted[order(-datafinsorted$TOTDMG), ][1:10, ]
datafinsorted$EVTYPE <- factor(datafinsorted$EVTYPE, levels = datafinsorted$EVTYPE)
Based on Figure 1, excessive heat is the event that yields the highest number of fatalities with a value of 1797.
ggplot(datahealthsorted, aes(x = EVTYPE, y = FATALITIES)) +
geom_bar(stat = "identity", fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("EVENT TYPE") + ylab("FATALITIES") +
ggtitle("Figure 1: No. of Fatalities by Top 10 Weather Events")
Based on Figure 2, tornado is the event that yields the highest number of injuries with a value of 20667.
ggplot(datahealthsorted2, aes(x = EVTYPE, y = INJURIES)) +
geom_bar(stat = "identity", fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("EVENT TYPE") + ylab("INJURIES") +
ggtitle("Figure 2: No. of Injuries by Top 10 Weather Events")
A quick at the table for total incidents (fatalities and injuries) shows that tornado is top event with a total of 22178 incidents.
datahealthsorted3 <- aggregate(TOTAL ~ EVTYPE, data = datahealth, sum)
datahealthsorted3 <- datahealthsorted3[order(-datahealthsorted3$TOTAL), ][1:10, ]
head(datahealthsorted3,10)
## EVTYPE TOTAL
## 75 tornado 22178
## 19 excessive heat 8188
## 24 flood 7279
## 72 thunderstorm wind 5505
## 49 lightning 4792
## 23 flash flood 2561
## 83 wildfire 1543
## 84 winter storm 1483
## 31 heat 1459
## 41 hurricane/typhoon 1451
Based on Figure 3, flood is the event that yields the greatest damage with value close to 150 billion dollars.
ggplot(datafinsorted, aes(x = EVTYPE, y = TOTDMG)) +
geom_bar(stat = "identity", fill = "blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("EVENT TYPE") + ylab("TOTAL DAMAGE") +
ggtitle("Figure 3: Cost of Total Damage by Top 10 Weather Events")
Based on the tables below drought is the weather event that brings the most crop damage with a value of ~ 13 billion dollars. For property damage only, flood is the top event ~140 billion dollars. It should be noted that damage in property value for the top event is more than 10x than worth of the top event in crop damage. It is therefore not surprising that the driver for the total damage rankings are coming from property damage list.
datafinsorted2 <- aggregate(CROPDMG2 ~ EVTYPE, data = datafin, sum)
datafinsorted2 <- datafinsorted2[order(-datafinsorted2$CROPDMG2), ][1:10, ]
head(datafinsorted2,10)
## EVTYPE CROPDMG2
## 20 drought 13367566000
## 50 hurricane/typhoon 5350107800
## 30 flood 4983266500
## 39 hail 2496822450
## 28 flash flood 1334901700
## 14 extreme cold/wind chill 1326623000
## 32 frost/freeze 1250911000
## 83 thunderstorm wind 1016942600
## 42 heavy rain 728169800
## 94 tropical storm 677711000
datafinsorted3 <- aggregate(PROPDMG2 ~ EVTYPE, data = datafin, sum)
datafinsorted3 <- datafinsorted3[order(-datafinsorted3$PROPDMG2), ][1:10, ]
head(datafinsorted3,10)
## EVTYPE PROPDMG2
## 30 flood 144003143200
## 50 hurricane/typhoon 81718889010
## 80 storm surge 43193536000
## 92 tornado 24616945710
## 28 flash flood 15222253910
## 39 hail 14595213420
## 83 thunderstorm wind 7913420880
## 104 wildfire 7760449500
## 94 tropical storm 7642475550
## 35 high wind 5249996360