The impact of Severe weather events in the United States

by Aki Taniguchi, 04/11/2019

I. Synopsis

Using the National Oceanic and Atmospheric Administration (NOAA) Storm Event Database, we will try to identify which type of events has been the most harmful, first in terms of population health, and second from an economic perspective. In order to do so, we will first process the dataset and keep only meaningful variables and observations. We will be using dates between 1996 and 2011 only as we will see that these are the most meaningful data ranges for the exercuse. Second, we will create 2 new variables: Human Damange and Economic damage, expressed as the sum of Fatalities and Injuries, and as the sum of Crop Damage and Property Damage (expressed in 1 US Dollar) respectively. Finally, we will plot each impact (both human and economic) against the different types of event. It will become clear to us “Tornado” events impact the most human health, whereas “Hurricane” damges the most the economy. Please note that everything has been coded and processed using R for Reproducibility sake, and sources (mostly URL) will be listed at the end of the analysis.

II. Data Processing

II. A. Loading data

Let’s load the data first (the path can be changed at user’s convenience). The URL to download the raw file can be found here.

setwd("C:/Users/tngch/Documents/R/Learning/5. Reproducible Research")
nameFile = "repdata_data_StormData.csv.bz2"
df = read.csv(nameFile)
dim(df)

## [1] 902297     37

dfSize = nrow(df) * ncol(df) * 8 / 2^20
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "Mb."))

## [1] "The dataframe size is 254.71 Mb."

Given the dimension of the datasets, it seems necessary to analyze the relevance of each observation and variable that will help us in our analysis. Reducing the size of our datasets will greatly enhance the processing speed of the exercise.

colnames(df)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

head(df)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

str(df)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

II. B. Subsetting rows

First, let us look at how many observations are done per year:

df$Year <- as.numeric(format(as.Date(df$BGN_DATE, "%m/%d/%Y"), "%Y"))
with(df, hist(Year, breaks = max(Year) - min(Year), main = "Number of observation per year", xlab = "Year", ylab = "Frequency"))

One bar in the histogram represents 1 year. We can clearly see that there is a huge jump in number of observation starting from 1996. As it can be inferred from the plot in NOAA’ s website, there has been a change in regulation (probably the implementation of Directive 10-1605), the NOAA started to increase its observation to 48 events. We will therefore subset data which starts from 1996 only as previous years might skew our analysis (we need to keep an apple-to-apple comparison).

df = df[df$Year == 1996,]

II. C. Subsetting columns

Looking at the variables, it seems that we only need:
1. Types of events: EVTYPE
2. Population health related variables: FATALITIES and INJURIES
3. Economic consequences related variables: PROPDMG and CROPDMG, and their respective exponent PROPDMGEXP and CROPDMGEXP.

df = df[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
dim(df)

## [1] 32270     7

dfSize = nrow(df) * ncol(df) * 8 / 2^20
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "Mb."))

## [1] "The dataframe size is 1.72 Mb."

We can see that the dataset size is greatly reduced.

Let’s now see whether the data looks good to be processed.

str(df)

## 'data.frame':    32270 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 972 834 856 856 856 244 359 856 856 856 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMG   : num  380 100 3 5 2 0 400 12 8 12 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 1 17 17 17 17 ...
##  $ CROPDMG   : num  38 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 7 1 1 1 1 1 1 1 1 1 ...

sum(is.na(df))

## [1] 0

length(unique(df$EVTYPE))

## [1] 228

The good news is there is no missing values so no need to do anything special on that front. However, from the NOAA documentation we know that there are 48 types of events, but we can clearly see there are more event type in the dataset: this is due to typo. We will need to clean this variable.

II. D. Cleaning the dataset

We first quantify the impact on population health and the impact on the economy:
1/ We will define the impact on population health as the sum of Fatalities and Injuries for each event, e.g. the sum of FATALITIES and INJURIES per event types.
2/ We will define the economic damage as being the dollar impact (in US dollar unit) done to property and crops, e.g. the sum of PROPDMG and CROPDMG, modified to take into account their respective Exponent.

length(intersect(unique(toupper(df$PROPDMGEXP)), unique(toupper(df$CROPDMGEXP)))) == length(unique(toupper(df$CROPDMGEXP)))

## [1] TRUE

length(intersect(unique(toupper(df$PROPDMGEXP)), unique(toupper(df$CROPDMGEXP)))) == length(unique(toupper(df$PROPDMGEXP)))

## [1] TRUE

We can see that all values of CROPDMG EXP are included in PROPDMGEXP. In order to convert PROPDMG and CROPDMG, we need to convert their Exponent in the relevant unit size thanks to the table shown here, and then reisizing PROPDMG and CROPDMG to put them on the same basis (e.g. 1 USD).

unitType = c(unique(toupper(df$PROPDMGEXP))) 
unitValue = c(1000, 0, 1000000)
unitConversionDf = data.frame(unitType = unitType, unitValue = unitValue)
df$PROPDMGEXP = toupper(df$PROPDMGEXP)
df$CROPDMGEXP = toupper(df$CROPDMGEXP)
df = merge(df, unitConversionDf, by.x = "PROPDMGEXP", by.y = "unitType")
df = merge(df, unitConversionDf, by.x = "CROPDMGEXP", by.y = "unitType")
df$ECONOMIC.DAMAGE = df$PROPDMG * df$unitValue.x + df$CROPDMG * df$unitValue.y

df$HUMAN.DAMAGE = df$FATALITIES + df$INJURIES

df = df[, c("EVTYPE", "HUMAN.DAMAGE", "ECONOMIC.DAMAGE")]
df = aggregate(list(df$HUMAN.DAMAGE, df$ECONOMIC.DAMAGE), by = list(df$EVTYPE), sum)
colnames(df) = c("EVTS", "TOTAL.HUMAN.DAMAGE", "TOTAL.ECONOMIC.DAMAGE")
df = df[df$TOTAL.HUMAN.DAMAGE != 0 & df$TOTAL.ECONOMIC.DAMAGE != 0,]

dfSize = nrow(df) * ncol(df) * 8
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "b. The number of column is", ncol(df)))

## [1] "The dataframe size is 912.00 b. The number of column is 3"

We now have a dataframe which expresses Human Damage and Economic Damage in function of Events. We can see the dataset has been drastically reduced, so we can now clean Event types following the original 48 defined by NOAA (given the size of the data is now a lot smaller, we will do it by hand).

df$EVTS = toupper(df$EVTS)

## Coastal flood / Coastal storm grouping
df[c(2, 3), "EVTS"]

## [1] "COASTAL FLOODING" "COASTAL STORM"

df$TOTAL.HUMAN.DAMAGE[2] = df$TOTAL.HUMAN.DAMAGE[2] + df$TOTAL.HUMAN.DAMAGE[3]
df$TOTAL.ECONOMIC.DAMAGE[2] = df$TOTAL.ECONOMIC.DAMAGE[2] + df$TOTAL.ECONOMIC.DAMAGE[3]
df = df[-c(3),]

## Extreme cold / Extended cold / Extreme Windchill grouping
df[c(6, 7, 8, 9), "EVTS"]

## [1] "EXTENDED COLD"     "EXTREME COLD"      "EXTREME COLD"     
## [4] "EXTREME WINDCHILL"

df$TOTAL.HUMAN.DAMAGE[7] = df$TOTAL.HUMAN.DAMAGE[6] + df$TOTAL.HUMAN.DAMAGE[7] + df$TOTAL.HUMAN.DAMAGE[9] + df$TOTAL.HUMAN.DAMAGE[8]
df$TOTAL.ECONOMIC.DAMAGE[7] = df$TOTAL.ECONOMIC.DAMAGE[6] + df$TOTAL.ECONOMIC.DAMAGE[7] + df$TOTAL.ECONOMIC.DAMAGE[9] + df$TOTAL.HUMAN.DAMAGE[8]
df = df[-c(6, 8, 9),]

## Heavy snow grouping
df[c(12, 13, 23, 24), "EVTS"]

## [1] "HEAVY SNOW"        "HEAVY SNOW SHOWER" "SNOW"             
## [4] "SNOW"

df$TOTAL.HUMAN.DAMAGE[12] = df$TOTAL.HUMAN.DAMAGE[12] + df$TOTAL.HUMAN.DAMAGE[13] + df$TOTAL.HUMAN.DAMAGE[24] + df$TOTAL.HUMAN.DAMAGE[23]
df$TOTAL.ECONOMIC.DAMAGE[12] = df$TOTAL.ECONOMIC.DAMAGE[12] + df$TOTAL.ECONOMIC.DAMAGE[13] + df$TOTAL.ECONOMIC.DAMAGE[24] + df$TOTAL.ECONOMIC.DAMAGE[23]
df = df[-c(13, 23, 24),]

## Heavy surf / High surf grouping
df[c(13, 14, 21), "EVTS"]

## [1] "HEAVY SURF" "HIGH SURF"  "ROUGH SURF"

df$TOTAL.HUMAN.DAMAGE[14] = df$TOTAL.HUMAN.DAMAGE[14] + df$TOTAL.HUMAN.DAMAGE[13] + df$TOTAL.HUMAN.DAMAGE[21]
df$TOTAL.ECONOMIC.DAMAGE[14] = df$TOTAL.ECONOMIC.DAMAGE[14] + df$TOTAL.ECONOMIC.DAMAGE[13] + df$TOTAL.ECONOMIC.DAMAGE[21]
df = df[-c(13, 21),]

## Strong wind grouping
df[c(20, 21), "EVTS"]

## [1] "STRONG WIND"  "STRONG WINDS"

df$TOTAL.HUMAN.DAMAGE[20] = df$TOTAL.HUMAN.DAMAGE[21] + df$TOTAL.HUMAN.DAMAGE[20]
df$TOTAL.ECONOMIC.DAMAGE[20] = df$TOTAL.ECONOMIC.DAMAGE[21] + df$TOTAL.ECONOMIC.DAMAGE[20]
df = df[-c(21),]

## Thunderstorm grouping (TSTM)
df[c(23, 24), "EVTS"]

## [1] "TSTM WIND"      "TSTM WIND/HAIL"

df$TOTAL.HUMAN.DAMAGE[23] = df$TOTAL.HUMAN.DAMAGE[24] + df$TOTAL.HUMAN.DAMAGE[23]
df$TOTAL.ECONOMIC.DAMAGE[23] = df$TOTAL.ECONOMIC.DAMAGE[24] + df$TOTAL.ECONOMIC.DAMAGE[23]
df = df[-c(24),]

The data is now fully processed:

dfSize = nrow(df) * ncol(df) * 8
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "b. The number of column is", ncol(df), "and the number of row is", nrow(df)))

## [1] "The dataframe size is 648.00 b. The number of column is 3 and the number of row is 27"

str(df)

## 'data.frame':    27 obs. of  3 variables:
##  $ EVTS                 : chr  "BLIZZARD" "COASTAL FLOODING" "DRY MICROBURST" "DUST DEVIL" ...
##  $ TOTAL.HUMAN.DAMAGE   : num  193 3 1 1 28 88 136 51 44 83 ...
##  $ TOTAL.ECONOMIC.DAMAGE: num  40755000 6375000 277000 8300 30000 ...

We are now ready to get the results of our analysis.

III. Results and Conclusions

Let’s plot the 2 graphs now that we have a nice and clean dataset.

library(ggplot2)
library(scales)

ggplot(data = df, aes(x = EVTS, y = TOTAL.HUMAN.DAMAGE)) + geom_bar(stat="identity") + theme_classic() + ggtitle("Number of Death/Injuries per Weather events") + xlab("Events Type") + ylab("Sum of Fatalities and Injuries") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(face="bold", angle=90))

ggplot(data = df, aes(x = EVTS, y = TOTAL.ECONOMIC.DAMAGE)) + geom_bar(stat="identity") + theme_classic() + ggtitle("Economic Impact per Weather events (in USD)") + xlab("Events Type") + ylab("Amount of Economic Impact in USD") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(face="bold", angle=90)) + scale_y_continuous(labels = comma)

paste(c("Total of deaths and injuries from 1996 to 2011:",df[which.max(df$TOTAL.HUMAN.DAMAGE), "TOTAL.HUMAN.DAMAGE"]))

## [1] "Total of deaths and injuries from 1996 to 2011:"
## [2] "731"

paste(c("Total economic impact:",format(df[which.max(df$TOTAL.ECONOMIC.DAMAGE), "TOTAL.ECONOMIC.DAMAGE"], big.mark = ","),"USD"))

## [1] "Total economic impact:" "1,732,035,000"         
## [3] "USD"

Using these 2 plots, it is clear that the event which has the most important impact on population health from 1996 to 2011 is Tornado. This represents 731 deaths and injuries combined.

Additionnally, we can also see that the event which has the most important economic impact from 1996 to 2011 is Hurricane. The impact amounts to $1.7bn.

IV. Sources

1/ Storm events Dataset: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
2/ Evolution of the dataset in NOAA’s website: https://www.ncdc.noaa.gov/stormevents/details.jsp
3/ NOAA documentation on variables: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
4/ Exponential conversion explanation: https://github.com/flyingdisc/RepData_PeerAssessment2/blob/master/how-to-handle-PROPDMGEXP.md

OS: Windows 10 Home
R.version: 3.6.1