Executive Summary

An analysis is done on the database of the U.S. National Oceanic and Atmospheric Administration (NOAA) to determine the top severe weather events that have the most impact in terms of fatalities, injuries and property damage. Although the data is from 1950 to 2011, only records from 1996 onwards were considered since the tracking of the 48 different weather event types were only observed during this period. The data was tidied by reclassifying the duplicate event type entries. The value for the property damages were calculated by conversion of the exponent variables. The topmost severe weather event that has the highest number fatalities is excessive heat (~1800). Tornado is the leading cause of total incidents: fatalities and injuries combined (~22,000). Flood is the top event that has the most economic impact with almost $150 billion of property damage.

Data Processing

Loading the data

dataFileName <- "./repdata-data-StormData.csv.bz2"
if (!file.exists(dataFileName)){
  url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(url = url, destfile = dataFileName,method = "curl")
}

if(!exists("stormData")){
stormData <- read.csv("repdata-data-StormData.csv.bz2")
}

Loading the required libraries

library(dplyr)
library(lubridate)
library(ggplot2)

Initial filtering

dim(stormData)
## [1] 902297     37
colnames(stormData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

A quick review of the data shows that there are 902297 observation and 37 variables. Not all the avariables are needed for the analyis so we select only variables that provide information for the health and economic impact :BGN_DATE,EVTYPE,FATALITIES, INJURIES, PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP.

stormDataRed <- stormData[,c("BGN_DATE","EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]

At this point we only select the entries that were recorded after 1995.

data1 <- mutate(stormDataRed, BGN_DATE=mdy_hms(BGN_DATE))
data1 <- filter(data1, year(BGN_DATE)> 1995)
dim(data1)
## [1] 653530      8

We only select the observations that resulted to any of the following : fatality, injury, crop damage or property damage

data1 <- data1[data1$FATALITIES > 0 | data1$INJURIES > 0 | data1$CROPDMG > 0 | data1$PROPDMG > 0,]
dim(data1)
## [1] 201318      8

We have narrowed down our dataset to 201318 observations.

Tidying entries in the event types

A quick look at the factors for the event types reveals 222 unique entries. This is far greater than the standard entries of 48 as indicated here Different weather types

We therefore need to tidy the entries for event type. By just transforming the event entries to lower case we have reduced the unique values from 222 to 186

length(unique(data1$EVTYPE))
## [1] 222
data1$EVTYPE <- tolower(data1$EVTYPE)
length(unique(data1$EVTYPE))
## [1] 186

A check of the top 50 most frequently occuring entries show duplication such as tstm wind and thunderstorm wind and rip current and rip currents.

data1$EVTYPE <- factor(data1$EVTYPE)
uniqueEvtype <- sort(table(data1$EVTYPE), decreasing=TRUE)
uniqueEvtype[1:50]
## 
##               tstm wind       thunderstorm wind                    hail 
##                   61776                   43097                   22679 
##             flash flood                 tornado               lightning 
##                   19011                   12366                   11152 
##                   flood               high wind             strong wind 
##                    9513                    5402                    3369 
##            winter storm              heavy rain              heavy snow 
##                    1460                    1047                    1029 
##                wildfire    urban/sml stream fld          excessive heat 
##                     847                     702                     685 
##               ice storm          tstm wind/hail          tropical storm 
##                     631                     441                     410 
##          winter weather        wild/forest fire             rip current 
##                     405                     381                     364 
##               avalanche                 drought            rip currents 
##                     264                     258                     239 
##                blizzard        lake-effect snow               landslide 
##                     228                     194                     190 
##             storm surge            extreme cold                    heat 
##                     169                     166                     164 
##           coastal flood              light snow      winter weather/mix 
##                     153                     141                     139 
##               high surf               hurricane            frost/freeze 
##                     128                     126                     117 
## extreme cold/wind chill        marine tstm wind                     fog 
##                     111                     109                     101 
##              dust storm         cold/wind chill              dust devil 
##                      96                      90                      84 
##             river flood          dry microburst       hurricane/typhoon 
##                      80                      75                      72 
##                    wind               dense fog    heavy surf/high surf 
##                      67                      58                      50 
##        storm surge/tide      marine strong wind 
##                      47                      46

The top 50 in terms ofmost frequently occurring event types were analyzed. Entries that seemed intuitively belonging to a standard event type entry are appropriately assigned. Others that could not be intuitively assigned are labeled as unclassified. White spaces were also removed.

tidytstm <- function(x) gsub("tstm", "thunderstorm", x)
removews <- function(x) gsub("^\\s+|\\s+$", "", x)
tidywinds <- function(x) {y <- gsub(" winds", " wind", x)
    gsub("winds", "wind", y)
}
tidyothers<-function(x) gsub("rip currents","rip current",x )
data1<-mutate(data1, EVTYPE=removews(EVTYPE), EVTYPE = tidytstm(EVTYPE), EVTYPE = tidywinds(EVTYPE), EVTYPE=tidyothers(EVTYPE))
data1$EVTYPE <- factor(data1$EVTYPE)
coastalflood <- c("coastal flood","coastal flooding")
densefog <- c("dense fog","fog")
fire <-c("wild/forest fire","wildfire")
flood <-c("urban/sml stream fld","fld")
frostfreeze <-c("frost","freeze")
hail <- c("small hail","hail")
heavysnow <-c("excessive snow","heavy snow","snow")
highwind <-c("gusty wind","high wind")
hurricane <-c("hurricane","typhoon","hurricane/typhoon")
surf <-c("heavy surf/high surf","heavy surf")
thunderstorm <- c("thunderstorm wind (g45)", "thunderstorm wind/hail","thunderstorm wind (g40)")
unclassified <-c("landslide","dry microburst","light snow","glaze","icy roads", "mixed precipitation","gradient wind","freezing drizzle","mudslide","coastal storm","freezing rain","light freezing rain","other","fog","river flood","wind")
windchill <-c("extreme windchill","cold/wind chill", "extreme cold/wind chill","cold","extreme cold")
winter <-c("winter weather/mix","winter weather")
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% thunderstorm] <- "thunderstorm wind"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% coastalflood] <- "coastal flood"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% flood] <- "flood"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% frostfreeze] <- "frost/freeze"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% hail] <- "hail"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% heavysnow] <- "heavy snow"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% fire] <- "wildfire"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% windchill] <- "extreme cold/wind chill"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% winter] <- "winter weather"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% hurricane] <- "hurricane/typhoon"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% surf] <- "high surf"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% highwind] <- "high wind"
levels(data1$EVTYPE)[levels(data1$EVTYPE) %in% unclassified] <- "unclassified"

unique.evtype <- sort(table(data1$EVTYPE), decreasing=TRUE)
unique.evtype[1:30]
## 
##        thunderstorm wind                     hail              flash flood 
##                   105363                    22690                    19012 
##                  tornado                lightning                    flood 
##                    12366                    11152                    10215 
##                high wind              strong wind             winter storm 
##                     5446                     3414                     1460 
##                 wildfire               heavy snow               heavy rain 
##                     1228                     1070                     1047 
##             unclassified           excessive heat                ice storm 
##                      792                      685                      631 
##              rip current           winter weather           tropical storm 
##                      603                      544                      410 
##  extreme cold/wind chill                avalanche                  drought 
##                      406                      264                      258 
##                 blizzard                high surf        hurricane/typhoon 
##                      228                      207                      207 
##         lake-effect snow            coastal flood              storm surge 
##                      194                      188                      169 
##                     heat marine thunderstorm wind             frost/freeze 
##                      164                      142                      132
length(unique.evtype)
## [1] 139

After the series of cleansing we still have 139 unique entries (still over the expected 48 standard entries). However the top 30th event entry only has a count of 132 compared to the top entry that has ~10500 occurrences. Since the objective of this analysis is to identify the topmost weather events we can make a practical assumption that the other entries will have minimal impact on the ranking.

Calculating actual property damage

A quick look at Property and Crop Damage data shows that there is an exponent factor. We presume here that k=K for thousand, m=M for million, h=H for hundred, b=B for billion and other entries =0. So when we see a CROPDMG =5 and CROPDMGEXP =h we assume that this is equal to 5 hundred or 5* 10^2. To get the actual value we convert the exponent value (CROPDMGEXP/PROPDMGEXP) to their corresponding numbers.(i.e. h=2, k=3, b=9 etc..) We raise ten to this number then multiply to the CROPDMG/PROPDMG to get the actual value. We add the property and crop damage to get the total value of the damage.

levels(data1$PROPDMGEXP)
##  [1] ""  "+" "-" "0" "1" "2" "3" "4" "5" "6" "7" "8" "?" "B" "H" "K" "M"
## [18] "h" "m"
levels(data1$CROPDMGEXP)
## [1] ""  "0" "2" "?" "B" "K" "M" "k" "m"
data1$CROPDMGEXP <- tolower(data1$CROPDMGEXP)
data1$PROPDMGEXP <- tolower(data1$PROPDMGEXP)

k3 <- function(x) gsub("k", "3", x)
h2 <- function(x) gsub("h", "2", x)
b9 <- function(x) gsub("b", "9", x)
m6 <- function(x) gsub("m", "6", x)
o0 <- function(x) gsub("[[:punct:]]", "0", x)

data1<-mutate(data1, CROPDMGEXP = h2(CROPDMGEXP), CROPDMGEXP = k3(CROPDMGEXP), 
           CROPDMGEXP = m6(CROPDMGEXP), CROPDMGEXP = b9(CROPDMGEXP), 
           CROPDMGEXP = o0(CROPDMGEXP),
           PROPDMGEXP = h2(PROPDMGEXP), PROPDMGEXP = k3(PROPDMGEXP), 
           PROPDMGEXP = m6(PROPDMGEXP), PROPDMGEXP = b9(PROPDMGEXP),
           PROPDMGEXP = o0(PROPDMGEXP), 
           PROPDMGEXP = as.numeric(PROPDMGEXP),CROPDMGEXP = as.numeric(CROPDMGEXP))

data2<-mutate(data1, CROPDMG2 = CROPDMG * (10^CROPDMGEXP),
             PROPDMG2 = PROPDMG * (10^PROPDMGEXP))
data2$CROPDMG2[is.na(data2$CROPDMG2)]<-0
data2$PROPDMG2[is.na(data2$PROPDMG2)]<-0
data2<-mutate(data2,TOTDMG=CROPDMG2+PROPDMG2)

Filtering and sorting of data

To determine the event type that has the most impact on human health we filter our data to entries that resulted to fatality or an injury. We then sort the data to top 10 rankings.

datahealth<- data2[data2$FATALITIES > 0 | data2$INJURIES > 0,]
dim(datahealth)
## [1] 12764    11
datahealth <-select(datahealth,EVTYPE,INJURIES,FATALITIES)
datahealth <-mutate(datahealth,TOTAL=INJURIES+FATALITIES)

datahealthsorted <- aggregate(FATALITIES ~ EVTYPE, data = datahealth, sum)
datahealthsorted <- datahealthsorted[order(-datahealthsorted$FATALITIES), ][1:10, ]
datahealthsorted$EVTYPE <- factor(datahealthsorted$EVTYPE, levels = datahealthsorted$EVTYPE)

datahealthsorted2 <- aggregate(INJURIES ~ EVTYPE, data = datahealth, sum)
datahealthsorted2 <- datahealthsorted2[order(-datahealthsorted2$INJURIES), ][1:10, ]
datahealthsorted2$EVTYPE <- factor(datahealthsorted2$EVTYPE, levels = datahealthsorted2$EVTYPE)

To determine the event type that has the most economic impact on we filter our data to entries that have total damage (property and crop damage) greater than zero.We then sort the data to top 10 rankings.

datafin<- data2[data2$TOTDMG > 0,]
datafinsorted <- aggregate(TOTDMG ~ EVTYPE, data = datafin, sum)
datafinsorted <- datafinsorted[order(-datafinsorted$TOTDMG), ][1:10, ]
datafinsorted$EVTYPE <- factor(datafinsorted$EVTYPE, levels = datafinsorted$EVTYPE)

Results

Health impact

Based on Figure 1, excessive heat is the event that yields the highest number of fatalities with a value of 1797.

ggplot(datahealthsorted, aes(x = EVTYPE, y = FATALITIES)) + 
  geom_bar(stat = "identity", fill = "red") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("EVENT TYPE") + ylab("FATALITIES") +
  ggtitle("Figure 1: No. of Fatalities by Top 10 Weather Events")

Based on Figure 2, tornado is the event that yields the highest number of injuries with a value of 20667.

ggplot(datahealthsorted2, aes(x = EVTYPE, y = INJURIES)) + 
  geom_bar(stat = "identity", fill = "red") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("EVENT TYPE") + ylab("INJURIES") +
  ggtitle("Figure 2: No. of Injuries by Top 10 Weather Events")

A quick at the table for total incidents (fatalities and injuries) shows that tornado is top event with a total of 22178 incidents.

datahealthsorted3 <- aggregate(TOTAL ~ EVTYPE, data = datahealth, sum)
datahealthsorted3 <- datahealthsorted3[order(-datahealthsorted3$TOTAL), ][1:10, ]
head(datahealthsorted3,10)
##               EVTYPE TOTAL
## 75           tornado 22178
## 19    excessive heat  8188
## 24             flood  7279
## 72 thunderstorm wind  5505
## 49         lightning  4792
## 23       flash flood  2561
## 83          wildfire  1543
## 84      winter storm  1483
## 31              heat  1459
## 41 hurricane/typhoon  1451

Economic impact

Based on Figure 3, flood is the event that yields the greatest damage with value close to 150 billion dollars.

ggplot(datafinsorted, aes(x = EVTYPE, y = TOTDMG)) + 
  geom_bar(stat = "identity", fill = "blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("EVENT TYPE") + ylab("TOTAL DAMAGE") +
  ggtitle("Figure 3: Cost of Total Damage by Top 10 Weather Events")

Based on the tables below drought is the weather event that brings the most crop damage with a value of ~ 13 billion dollars. For property damage only, flood is the top event ~140 billion dollars. It should be noted that damage in property value for the top event is more than 10x than worth of the top event in crop damage. It is therefore not surprising that the driver for the total damage rankings are coming from property damage list.

datafinsorted2 <- aggregate(CROPDMG2 ~ EVTYPE, data = datafin, sum)
datafinsorted2 <- datafinsorted2[order(-datafinsorted2$CROPDMG2), ][1:10, ]
head(datafinsorted2,10)
##                     EVTYPE    CROPDMG2
## 20                 drought 13367566000
## 50       hurricane/typhoon  5350107800
## 30                   flood  4983266500
## 39                    hail  2496822450
## 28             flash flood  1334901700
## 14 extreme cold/wind chill  1326623000
## 32            frost/freeze  1250911000
## 83       thunderstorm wind  1016942600
## 42              heavy rain   728169800
## 94          tropical storm   677711000
datafinsorted3 <- aggregate(PROPDMG2 ~ EVTYPE, data = datafin, sum)
datafinsorted3 <- datafinsorted3[order(-datafinsorted3$PROPDMG2), ][1:10, ]
head(datafinsorted3,10)
##                EVTYPE     PROPDMG2
## 30              flood 144003143200
## 50  hurricane/typhoon  81718889010
## 80        storm surge  43193536000
## 92            tornado  24616945710
## 28        flash flood  15222253910
## 39               hail  14595213420
## 83  thunderstorm wind   7913420880
## 104          wildfire   7760449500
## 94     tropical storm   7642475550
## 35          high wind   5249996360

References

Different weather types

For queries you may contact me through my LinkedIn Account.