In this study we shought to explore the effects of severe weather on health and economics in the US. To do this, we analyzed storm data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950 to November 2011. The hypothesis was that certain types of events would cause more damange to property while others would be more damaging to health. We found some interesting relationships between the event types, and the damages caused, over time. On a per event basis, “moving” events, such as tornadoes and floods cause more damage to property, while “still” events such as heat and cold cause more injuries. However, over time, seasonal events like floods and tornadoes have higher injury and death tools, just as they cause more damange to property and crops than the other events.

library(dplyr)
library(data.table)
library(R.utils)

trim <- function (x) gsub("^\\s+|\\s+$", "", x)
fmt <- function(x) {
        format(x, decimal.mark=".", big.mark=",",, , small.interval=3, nsmall=2, scientific = F)
    }

Research Question

The purpose of the study is to identify which types of events, among events like avalanches, fogs, and extreme cold, have highest impact on health and the economy in the US. The hypothesis is that events with motion, e.g. storms, have higher impact on property than those without, e.g. extreme fog, while the latter might have a significant impact on health. To investigate this, we’ll be using data from NOAA’s storm database to analyse possible correlations between economic damage, injuries and fatalities caused by each type of event. Our 2 main questions are:

  1. Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

The data was downloaded in bz2 format, decompressed into a CSV, and loaded into a data variable.

dataFile <- "repdata-data-StormData.csv.bz2"
decomFile <-"repdata-data-StormData.csv"
if (!dataFile %in% dir("./")) {
      download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = dataFile)
}

if (!decomFile %in% dir("./")) {
     bunzip2(dataFile, decomFile, remove = FALSE, skip = TRUE)
}

data <- dplyr::tbl_df(data.table::fread(decomFile))

Read 0.0% of 967216 rows
Read 24.8% of 967216 rows
Read 42.4% of 967216 rows
Read 54.8% of 967216 rows
Read 73.4% of 967216 rows
Read 83.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:09
Warning in data.table::fread(decomFile): Read less rows (902297) than were
allocated (967216). Run again with verbose=TRUE and please report.

Looking at the dimensions of the data set, there are 902297 entries and 37 columns.

dim(data)
[1] 902297     37

We observe the columns to identify the most relevant ones to answer our questions:

names(data)
 [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
 [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
[11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
[21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
[26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
[31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS"    "REFNUM"    

To measure health hazard, we will evaluate two fields: FATALITIES and INJURIES, representing number of fatalities and injuries caused by the event, respectively. To measure economic impact, we will look at the fields PROPDMG and CROPDMG, which represent the estimated damage to private property (structures, objects, vegetation) and public infrastructure, and crops, respectively; these two variables have two complementary variables, PROPDMGEXP and CROPDMGEXP, which describe the exponent of the value as follows: K" for thousands of dollars, “M” for millions and “B” for billions, and an empty string for non.

It’s worth noting that at this point it’s unknown whether the damage estimates are adjusted for inflation, and the data ranges for time span of 61 years.

Peeking at the event types, we find many inconsistencies. The values ought to represent specific natural phenomenon that are of locally non-common nature, such as snow in near tropical regions. However, there are entries with titles such as hvy rain and wnd, which indicate different writings of the same event (mispellings), and others such as summary july 23-24, which are presumably summary damage reports up to that point, though the year is not specified. Also, there were entries with leading and trailing spaces, which can also indicate different categories representing the same event type.

data.sub <- data %>% dplyr::select(STATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, EVTYPE)

We check the number of unique events before trying to clean it.

length(unique(data.sub$EVTYPE))
[1] 985

To clean the data, we first trim the values of the event type, and convert them to lower case. Then, we remove the entries that represented summaries, idetified by the word “summary” at the beginning.

data.sub$EVTYPE <- tolower(trim(data.sub$EVTYPE))
nonSummary <- sapply(data.sub$EVTYPE, function(x){return ()})
data.sub <- data.sub[grep("summary", data.sub$EVTYPE) != 0,]

m = dim(data.sub)[1]

To decide whether it would be worth correcting mispelled event labels, we check the ratio of some mispelled labels:

sum(data.sub$EVTYPE=="wnd")/m
[1] 1.108283e-06
sum(data.sub$EVTYPE=="hvy rain")/m
[1] 2.216565e-06

The ratios are considerably insignificant. Still, these mispelled events may be outliers or have had considerable damage levels to health or property, and so where identified, we correct them.

data.sub$EVTYPE[data.sub$EVTYPE=="wnd"] = "wind"
data.sub$EVTYPE[data.sub$EVTYPE=="hvy rain"] = "heavy rain"

After doing that, we reduced the number of unique event types to length(unique(data.sub$EVTYPE)). Before, we proceed with the analysis, we convert the property and damage values to their numeric dollar values with their power, to make comparison easier.

# this is done to make the switch case work
pow <- function(x){
    if (x == ""){
        return(1)
    }else if (x == "K"){
        return (10^3)
    }else if (x == "M"){
        return (10^6)
    }else if (x == "B"){
        return (10^9)
    }else{
        return (0)
    }
}

data.summ <- data.sub %>% dplyr::mutate(PROPERTY.DAMAGE = PROPDMG*sapply(PROPDMGEXP, pow), CROP.DAMAGE = CROPDMG*sapply(CROPDMGEXP, pow))

Analysis

We first group our data points by event type.

data.gp <- data.summ %>% dplyr::group_by(EVTYPE) %>% 
    summarise(PROPERTY.DAMAGE=sum(PROPERTY.DAMAGE), CROP.DAMAGE=sum(CROP.DAMAGE), FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES), N=n())

With the data summary, we can observe basic statistics on injuries and fatalities, and economic damages. The most fatal event had a fatality of 583, and the event with highest injury count had 1,700.00 registered or estimated injuries. As for property damages, the mean is at about USD 473,546.67, and the highest is USD 115,000,000,000.00, while crop damages averaged at USD 54,409.75, and the highest was USD 5,000,000,000.00

With this data grouped, we can acummulate the economic damages. The first interactive visualization highlights let’s us explore the effects the different event types have on property and crop damages, and their level of fatalities.

library(plotly)

plot_ly(data.gp, x = CROP.DAMAGE, y = PROPERTY.DAMAGE, text = paste("EventType: ", EVTYPE),
        mode = "markers", color = FATALITIES) %>% layout( title = "Total Crop vs Property Damage", hovermode="closest" )

At first glance, we can see that most events are clustered at the bottom left of our chart, which indicates that same close or similar levels of damage to crops and properties. Some events such as storm surges and heavy rain have non registered impact on crops, while severe thunderstorms and windchills don’t seem to have an impact on property.

Events by Crop Damage

head(arrange(data.gp, desc(CROP.DAMAGE)))
Source: local data frame [6 x 6]

       EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES      N
        (chr)           (dbl)       (dbl)      (dbl)    (dbl)  (int)
1     drought      1046106000 13972566000          0        4   2488
2       flood    144657709807  5661968450        470     6789  25327
3 river flood      5118945500  5029459000          2        2    173
4   ice storm      3944927810  5022113500         89     1975   2006
5        hail     15727366777  3025537453         15     1361 288661
6   hurricane     11868319010  2741910000         61       46    174

Floods are more impactful to property than any other event, and it stands as an outlier in that aspect, although, they also have the second highest rank in damange to crops, where droughts lead in comparison. The damage estimates between both differ by a factor of about 10, with floods causing about USD 146B in damanges to property, while droughts have caused about USD 13B over a period of 61 years.

Events by Property Damage

head(arrange(data.gp, desc(PROPERTY.DAMAGE)))
Source: local data frame [6 x 6]

             EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES      N
              (chr)           (dbl)       (dbl)      (dbl)    (dbl)  (int)
1             flood    144657709807  5661968450        470     6789  25327
2 hurricane/typhoon     69305840000  2607872800         64     1275     88
3           tornado     56925660483   414953110       5633    91346  60652
4       storm surge     43323536000        5000         13       38    261
5       flash flood     16140861717  1421317100        978     1777  54278
6              hail     15727366777  3025537453         15     1361 288661

Looking at fatalities though, tornados have been responsible for more loss of life than any other event, with over 5,000 deaths, followed by excessive heat, with a tally of almost 2,000 deaths.

Events by Fatalities

head(arrange(data.gp, desc(FATALITIES)))
Source: local data frame [6 x 6]

          EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES      N
           (chr)           (dbl)       (dbl)      (dbl)    (dbl)  (int)
1        tornado     56925660483   414953110       5633    91346  60652
2 excessive heat         7753700   492402000       1903     6525   1678
3    flash flood     16140861717  1421317100        978     1777  54278
4           heat         1797000   401461500        937     2100    767
5      lightning       928659283    12092090        816     5230  15755
6      tstm wind      4493058440   554007350        504     6957 219946

Events by Injuries

head(arrange(data.gp, desc(INJURIES)))
Source: local data frame [6 x 6]

          EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES      N
           (chr)           (dbl)       (dbl)      (dbl)    (dbl)  (int)
1        tornado     56925660483   414953110       5633    91346  60652
2      tstm wind      4493058440   554007350        504     6957 219946
3          flood    144657709807  5661968450        470     6789  25327
4 excessive heat         7753700   492402000       1903     6525   1678
5      lightning       928659283    12092090        816     5230  15755
6           heat         1797000   401461500        937     2100    767

These measures are merely summaries, totalling the impact of each event over our measured period. We now turn to analyzing the impact of each event on a per/event basis, to see if perhaps there are events that occur less than others, but are more damaging when they occur.

plot_ly(data.gp, x = CROP.DAMAGE/N, y = PROPERTY.DAMAGE/N, text = paste("EventType: ", EVTYPE),
        mode = "markers", color=FATALITIES/N) %>% layout( title="Crop vs Property Damage Per Event", hovermode="closest" )

Taking an average of the damage caused by event types, the picture changes. Excessive wetness, cold wet conditions, and excessive freeze are identified as the most damaging events to crops, racking up USD 142M, USD 66M, USD 37M in damages each time they occur. For property damage we see tornadoes, tstm winds and hails tying in as most damaging events, racking up USD 1.6B each time they occur on average.

Events by Crop Damage

head(data.gp %>% arrange(desc(CROP.DAMAGE/N)) %>% mutate(CROP.DAMAGE.TURN=CROP.DAMAGE/N) %>% select(EVTYPE, CROP.DAMAGE.TURN))
Source: local data frame [6 x 2]

                   EVTYPE CROP.DAMAGE.TURN
                    (chr)            (dbl)
1       excessive wetness        142000000
2 cold and wet conditions         66000000
3         damaging freeze         37028750
4       hurricane/typhoon         29634918
5             river flood         29072017
6             early frost         21000000

Events by Property Damage

head(data.gp %>% arrange(desc(PROPERTY.DAMAGE/N)) %>% mutate(PROPERTY.DAMAGE.TURN=PROPERTY.DAMAGE/N) %>% select(EVTYPE, PROPERTY.DAMAGE.TURN))
Source: local data frame [6 x 2]

                      EVTYPE PROPERTY.DAMAGE.TURN
                       (chr)                (dbl)
1 tornadoes, tstm wind, hail           1600000000
2  heavy rain/severe weather           1250000000
3          hurricane/typhoon            787566364
4             hurricane opal            350316222
5                storm surge            165990559
6                 wild fires            156025000

Observing fatalities on a per event basis, we get a different picture as well. Tornadoes/tstm/hails still remain the events with highest death toll on each occurance, with an average of 25, but they are now followed by cold and snow with 14. Excessive heat ranks 4th with an everage of 6 deaths.

Events by Fatalities

head(data.gp %>% arrange(desc(FATALITIES/N)) %>% mutate(FATALITIES.TURN=FATALITIES/N) %>% select(EVTYPE, FATALITIES.TURN))
Source: local data frame [6 x 2]

                      EVTYPE FATALITIES.TURN
                       (chr)           (dbl)
1 tornadoes, tstm wind, hail       25.000000
2              cold and snow       14.000000
3      tropical storm gordon        8.000000
4      record/excessive heat        5.666667
5               extreme heat        4.363636
6          heat wave drought        4.000000

Finally, we can plot he number of injuries against fatailities to see which events have highest impact on health.

plot_ly(data.gp, x = FATALITIES/N, y = INJURIES/N, text = paste("EventType: ", EVTYPE),
        mode = "markers") %>% layout( title = "Injuries vs Fatalities per Event", hovermode="closest" )

Tropical storms cause on average 43 injuries on each occurance, followed by wild fires with roughly 38, and thunderstorms with 27. And we can verify that in the table.

Events by Injuries

head(data.gp %>% arrange(desc(INJURIES/N)) %>% mutate(INJURIES.TURN=INJURIES/N) %>% select(EVTYPE, INJURIES.TURN))
Source: local data frame [6 x 2]

                 EVTYPE INJURIES.TURN
                  (chr)         (dbl)
1 tropical storm gordon          43.0
2            wild fires          37.5
3         thunderstormw          27.0
4    high wind and seas          20.0
5       snow/high winds          18.0
6       glaze/ice storm          15.0

Results

Looking at whole period of 61 years, from 1950 to November 2011, the events with the highest damage to crops are droughts, with a total damage of USD 13B, followed by ice storms and river floods both of with a total of USD 5B . The events with highest damage to property are floods, with a total damage of 144B, i.e. roughly 10 times more damaging than droughts; in ranking, floods are followed by hurricanes and tornados with damages totaling USD 69B and USD 56B, and the latter leads in fatalities caused.

However, individually, the events with highest economic impact are wetness/cold wet conditions/excessive freeze which are the most damanging events to crops, causing between USD 37-142M in damages each time they occur, while tornadoes/tstm winds and hails cause greater damange to property, averaged at USD 1.6B each time they occur. Worthy of noting is that tornadoes do not bear highest impact on health, despite their high damage to properties, but instead it’s tropical storms, and wild fires, causing up to 43 and 37 injuries on each occurance on average. Nonetheless, cyclic events such as floods seem to occur far more often than other events, and thus need careful attention as they amess greater total damange over time.

Interestingly, there was a distinction found between the events that cause more damange to health versus property and crops, when looking at event occurances. While moving events, such tornados and hurricanes have higher damage toll, it is still events such as wild firest, and heat that cause more injuries. However, over time, seasonal events like floods and tornadoes have higher injury and death tools, just as they cause more damange to property and crops than any other events.