Harmful Weather Events and their Impact on Economic Activity

Synopsis

In this report, we identify severe weather events which are most harmful with respect to population health, as well as events which have the greatest economic consequences. To identify these events, we leverage on data from the NOAA Storm Database, which tracks characteristics of major storms and weather events in the United States. From our analysis, we find that tornados are the most harmful with respect to population health, having killed over 5000 people and injuring more than 90,000 people, while Floods have the greatest economic impact, causing the US more than $100 billion in collateral damage.

Data Processing

From the National Oceanic & Atmospheric Administration, we obtained the data on severe weather events.

We start by setting our working directory, and importing key libraries which helps us conduct our data analysis in a much more efficient manner

library(lattice)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

We first read the first 500 rows of the dataset, to obtain the column type of the dataset. This will facilitate faster reading of the data. We store the column types of the data in a vector, and read the NOAA dataset using the column type as a variable.

noaa500 <- read.table("repdata-data-StormData.csv", sep = ",", nrows = 500)
colclass <- c()

for (i in 1:dim(noaa500)[2]){
        colclass <- c(colclass, class(noaa500[2, i]))
}

noaa <- read.table("repdata-data-StormData.csv", sep = ",", colClasses = colclass, header = T, 
                   comment.char = "#")
dim(noaa)
## [1] 902297     37

After reading the dataset, we check the first few rows, noting that there are 902297 rows in this dataset.

head(noaa[, 1:10])
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1    1.00  4/18/1950 0:00:00     0130       CST  97.00     MOBILE    AL
## 2    1.00  4/18/1950 0:00:00     0145       CST   3.00    BALDWIN    AL
## 3    1.00  2/20/1951 0:00:00     1600       CST  57.00    FAYETTE    AL
## 4    1.00   6/8/1951 0:00:00     0900       CST  89.00    MADISON    AL
## 5    1.00 11/15/1951 0:00:00     1500       CST  43.00    CULLMAN    AL
## 6    1.00 11/15/1951 0:00:00     2000       CST  77.00 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI
## 1 TORNADO      0.00        
## 2 TORNADO      0.00        
## 3 TORNADO      0.00        
## 4 TORNADO      0.00        
## 5 TORNADO      0.00        
## 6 TORNADO      0.00

We extract the column of interest (in this case, the Event Type specified by EVTYPE), and print a brief summary.

evtype <- noaa$EVTYPE
summary(evtype)[1:10]; length(unique(evtype))
##               HAIL          TSTM WIND  THUNDERSTORM WIND 
##             288661             219940              82563 
##            TORNADO        FLASH FLOOD              FLOOD 
##              60652              54277              25326 
## THUNDERSTORM WINDS          HIGH WIND          LIGHTNING 
##              20843              20212              15754 
##         HEAVY SNOW 
##              15708
## [1] 985

We check if there are any missing values in the dataset.

sum(is.na(evtype))
## [1] 0

As we are interested in finding out the types of events which are most harmful with respect to population health, and the types of events which have the greatest economic consequences, we first filter the dataframe to only include columns which are interested in, and convert the columns to the appropriate formats. We start by creating 2 separate dataframes for Injuries and Fatalities:

Injuries

df_inj <- select(noaa, EVTYPE, INJURIES) 
df_inj$INJURIES <- as.numeric(as.character(df_inj$INJURIES))
df_inj <- filter(df_inj, INJURIES > 0)
head(df_inj)
##    EVTYPE INJURIES
## 1 TORNADO       15
## 2 TORNADO        2
## 3 TORNADO        2
## 4 TORNADO        2
## 5 TORNADO        6
## 6 TORNADO        1

Fatalities

df_fat <- select(noaa, EVTYPE, FATALITIES)
df_fat$FATALITIES <- as.numeric(as.character(df_fat$FATALITIES))
df_fat <- filter(df_fat, FATALITIES > 0)
head(df_fat)
##    EVTYPE FATALITIES
## 1 TORNADO          1
## 2 TORNADO          1
## 3 TORNADO          4
## 4 TORNADO          1
## 5 TORNADO          6
## 6 TORNADO          7

Creating a separate dataframe for Property and Crop Damage

df_dmg <- select(noaa, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
df_dmg$PROPDMG <- as.numeric(as.character(df_dmg$PROPDMG))
df_dmg$CROPDMG <- as.numeric(as.character(df_dmg$CROPDMG))

# Converting to lowercase
df_dmg$PROPDMGEXP <- tolower(df_dmg$PROPDMGEXP)
df_dmg$CROPDMGEXP <- tolower(df_dmg$CROPDMGEXP)

# We look at "b" and "m" as they stand for billions and millions - allowing us to take a more focused approach to identifying high-profile (in terms of economic impact) event types
df_dmg <- filter(df_dmg, CROPDMGEXP %in% c("b", "m") | PROPDMGEXP %in% c("b", "m"))
head(df_dmg)
##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO     2.5          m       0           
## 2 TORNADO     2.5          m       0           
## 3 TORNADO     2.5          m       0           
## 4 TORNADO     2.5          m       0           
## 5 TORNADO     2.5          m       0           
## 6 TORNADO     2.5          m       0
dim(df_dmg)
## [1] 12693     5

Noting that there were only the dataset contains only 48 unique event types, but the data contains 985 event types, we attempt to fix some of the errors through the use of functions like tolower and trimws.

df_inj$EVTYPE <- lapply(df_inj$EVTYPE, tolower); df_fat$EVTYPE <- lapply(df_fat$EVTYPE, tolower)
df_inj$EVTYPE <- trimws(df_inj$EVTYPE); df_fat$EVTYPE <- trimws(df_fat$EVTYPE)

df_dmg$EVTYPE <- lapply(df_dmg$EVTYPE, tolower)
df_dmg$EVTYPE <- trimws(df_dmg$EVTYPE)

As the dataset used to analyse the economic consequences of various event types contain PROPDMGEXP and CROPDMGEXP, we need to carry out additional processing to the data frame to obtain the total impact from the disaster. See here for more information on processing the data. Following which, we sum the Total Property Damage column with the Total Crop Damage column to obtain the total damage done.

# To see the potential unique values which can arise
unique(union(unique(df_dmg$PROPDMGEXP), unique(df_dmg$CROPDMGEXP)))
## [1] "m" "b" "k" ""  "5" "0" "?"
# We identify the values to be "m", "b", "k", "", "5", "0", "?"; convert values in accordance
for (i in 1:dim(df_dmg)[1]){
        df_dmg$TOT_PROPDMG[i] <- 
                if (df_dmg$PROPDMGEXP[i] == "b"){
                        df_dmg$PROPDMG[i] * 1000000000
                } else if (df_dmg$PROPDMGEXP[i] == "m"){
                        df_dmg$PROPDMG[i] * 1000000
                } else if (df_dmg$PROPDMGEXP[i] == "k"){
                        df_dmg$PROPDMG[i] * 1000
                } else if (df_dmg$PROPDMGEXP[i] %in% c("", "?")){ 
                        0
                } else {
                        df_dmg$PROPDMG[i] * 10 
                }
}

for (i in 1:dim(df_dmg)[1]){
        df_dmg$TOT_CROPDMG[i] <- 
                if (df_dmg$CROPDMGEXP[i] == "b"){
                        df_dmg$CROPDMG[i] * 1000000000
                } else if (df_dmg$CROPDMGEXP[i] == "m"){
                        df_dmg$CROPDMG[i] * 1000000
                } else if (df_dmg$CROPDMGEXP[i] == "k"){
                        df_dmg$CROPDMG[i] * 1000
                } else if (df_dmg$PROPDMGEXP[i] %in% c("", "?")){
                        0
                } else {
                        df_dmg$CROPDMG[i] * 10
                }
}

df_dmg$TOT_DMG <- df_dmg$TOT_PROPDMG + df_dmg$TOT_CROPDMG
df_dmg$TOT_DMG <- df_dmg$TOT_DMG/1000000000
df_dmg <- df_dmg[, c("EVTYPE", "TOT_DMG")]
head(df_dmg)
##    EVTYPE TOT_DMG
## 1 tornado  0.0025
## 2 tornado  0.0025
## 3 tornado  0.0025
## 4 tornado  0.0025
## 5 tornado  0.0025
## 6 tornado  0.0025

Results

We apply the aggregate function on the 2 datasets to group them by the various Event Types, and since we are interested in the total effect each event has on population health and its economic impact, we pass in the sum function.

Injuries

df_inj <- aggregate(INJURIES ~ EVTYPE, data = df_inj, sum, na.rm = T)
head(df_inj)
##          EVTYPE INJURIES
## 1     avalanche      170
## 2     black ice       24
## 3      blizzard      805
## 4  blowing snow       14
## 5    brush fire        2
## 6 coastal flood        2

Fatalities

df_fat <- aggregate(FATALITIES ~ EVTYPE, data = df_fat, sum, na.rm = T)
head(df_fat)
##          EVTYPE FATALITIES
## 1      avalance          1
## 2     avalanche        224
## 3     black ice          1
## 4      blizzard        101
## 5  blowing snow          2
## 6 coastal flood          3

Economic Consequence

df_dmg <- aggregate(TOT_DMG ~ EVTYPE, data = df_dmg, sum, na.rm = T)
head(df_dmg)
##                      EVTYPE TOT_DMG
## 1       agricultural freeze 0.02882
## 2    astronomical high tide 0.00850
## 3                 avalanche 0.00210
## 4                  blizzard 0.74703
## 5 coastal  flooding/erosion 0.01500
## 6             coastal flood 0.24628

Next, we apply a threshold value to reduce the dataset to focus on the top 5 events (by how much of an impact it had):

Injuries

thr_inj <- sort(df_inj$INJURIES, decreasing = T)[5]
thr_fat <- sort(df_fat$FATALITIES, decreasing = T)[5]
thr_dmg <- sort(df_dmg$TOT_DMG, decreasing = T)[5]

The threshold values for injuries, fatalities and economic consequences are 5230, 816 and 17.5620265 respectively.

We filter the datasets based on these threshold values

df_inj <- filter(df_inj, INJURIES >= thr_inj)
df_fat <- filter(df_fat, FATALITIES >= thr_fat)
df_dmg <- filter(df_dmg, TOT_DMG >= thr_dmg)

Finally, we use plots to identify the top 5 event types:

Injuries

g_inj <- ggplot(df_inj, aes(x = EVTYPE, y = INJURIES)) + theme_bw() + geom_bar(stat = 'identity') + 
        xlab('Event Type') + ylab('Total Injuries') + 
        ggtitle('Total Injuries by Event Type')
g_inj

Fatalities

g_fat <- ggplot(df_fat, aes(x = EVTYPE, y = FATALITIES)) + theme_bw() + geom_bar(stat = 'identity') + 
        xlab('Event Type') + ylab('Total Fatalities') + 
        ggtitle('Total Fatalities by Event Type')
g_fat

Economic Consequence

g_fat <- ggplot(df_dmg, aes(x = EVTYPE, y = TOT_DMG)) + theme_bw() + geom_bar(stat = 'identity') + 
        xlab('Event Type') + ylab('Total Economic Damage (in $ Bilions)') + 
        ggtitle('Total Economic Impact by Event Type')
g_fat

It turns out that while tornados have the worst impact on both population health, killing over 5000 and injuring more than 90,000, floods have had the greatest economic consequences, costing the US more than $100 billion.