Synopsis

In this report we explore the NOAA Storm database between the years 1950 and 2011. The aim is to find out which weather events have caused greatest impact on public health and economy. We analyzed the NOAA storm database in two aspects: events with consequences for the public health and events with consequences for the economy. To investigate most harmful events to public health we totaled harms throughout the years and cut top 20% most harmful events. It came out that TORNADO is exceeding by large degree in terms of magnitude the other harmful event - EXCESSIVE HEAT, HIGH WIND and FLOOD. The same approach was taken with the investigation of the event types with greatest impact on economy. It came out that FLOOD, HURRICANE (TYPHOON) and TORNADO are the top 3 weather phenomena in that respect.

Data processing

library(dplyr)
library(readr)
library(stringdist)
library(ggplot2)
options(scipen = 20)

First we read the data from the repdatadataStormData.csv.bz2 file, downloaded from Peer-graded Assignment: Course Project 2 site.

StormData <- read_csv("repdatadataStormData.csv.bz2")

we check the first few rows (there are 902 297) rows in this dataset.

dim(StormData)
## [1] 902297     37
head(StormData[1:3, ])
## # A tibble: 3 x 37
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
##     <dbl>             <chr>    <chr>     <chr>  <dbl>      <chr> <chr>
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## # ... with 30 more variables: EVTYPE <chr>, BGN_RANGE <dbl>,
## #   BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## #   COUNTY_END <dbl>, COUNTYENDN <chr>, END_RANGE <dbl>, END_AZI <chr>,
## #   END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>,
## #   FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## #   CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## #   ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## #   LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

The columns that we analyze are: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. We transform PROPDMGEXP and CROPDMGEXP to factor variables as they denote a multiplier of PROPDMG and CROPDMG values.

DF_StormData <- select(StormData,
                       EVTYPE,
                       INJURIES,
                       FATALITIES,
                       PROPDMG,
                       PROPDMGEXP,
                       CROPDMG,
                       CROPDMGEXP) %>%
        mutate(EVTYPE = toupper((StormData$EVTYPE)),
               PROPDMGEXP = as.factor(StormData$PROPDMGEXP),
               CROPDMGEXP = as.factor(StormData$CROPDMGEXP))
head(DF_StormData)
## # A tibble: 6 x 7
##    EVTYPE INJURIES FATALITIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
##     <chr>    <dbl>      <dbl>   <dbl>     <fctr>   <dbl>     <fctr>
## 1 TORNADO       15          0    25.0          K       0         NA
## 2 TORNADO        0          0     2.5          K       0         NA
## 3 TORNADO        2          0    25.0          K       0         NA
## 4 TORNADO        2          0     2.5          K       0         NA
## 5 TORNADO        2          0     2.5          K       0         NA
## 6 TORNADO        6          0     2.5          K       0         NA

Exponent processing PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP

In order to calculate correctly the amounts in PROPDMG and CROPDMG columns we have to take into account the values in PROPDMGEXP and CROPDMGEXP columns. They indicate whether we have to multiply the value by 100, 1 000, 1 000 0000 etc. For this purpose we define a function which sets the correct multiplier.

getExp <- function (multiplier){
        if (is.na(multiplier)){
                m <- 0
        }
        else if (any(multiplier == "H" | multiplier == "h")){
                m <- 100
        }
        else if (any(multiplier == "K" | multiplier == "k")){
                m <- 1000
        } 
        else if (any(multiplier == "M" | multiplier == "m")){
                m <- 1000000
        } 
        else if (any(multiplier == "B" | multiplier == "b")){
                m <- 1000000000
        }
        else{
                m <- 0
        }
   
        m
}

Event types processing

According to the NOAA documentation there are 48 official event types. We load these types and transform to uppercase.

EVT_OFFIC <- toupper(c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather"))

EVT_OFFIC
##  [1] "ASTRONOMICAL LOW TIDE"    "AVALANCHE"               
##  [3] "BLIZZARD"                 "COASTAL FLOOD"           
##  [5] "COLD/WIND CHILL"          "DEBRIS FLOW"             
##  [7] "DENSE FOG"                "DENSE SMOKE"             
##  [9] "DROUGHT"                  "DUST DEVIL"              
## [11] "DUST STORM"               "EXCESSIVE HEAT"          
## [13] "EXTREME COLD/WIND CHILL"  "FLASH FLOOD"             
## [15] "FLOOD"                    "FROST/FREEZE"            
## [17] "FUNNEL CLOUD"             "FREEZING FOG"            
## [19] "HAIL"                     "HEAT"                    
## [21] "HEAVY RAIN"               "HEAVY SNOW"              
## [23] "HIGH SURF"                "HIGH WIND"               
## [25] "HURRICANE (TYPHOON)"      "ICE STORM"               
## [27] "LAKE-EFFECT SNOW"         "LAKESHORE FLOOD"         
## [29] "LIGHTNING"                "MARINE HAIL"             
## [31] "MARINE HIGH WIND"         "MARINE STRONG WIND"      
## [33] "MARINE THUNDERSTORM WIND" "RIP CURRENT"             
## [35] "SEICHE"                   "SLEET"                   
## [37] "STORM SURGE/TIDE"         "STRONG WIND"             
## [39] "THUNDERSTORM WIND"        "TORNADO"                 
## [41] "TROPICAL DEPRESSION"      "TROPICAL STORM"          
## [43] "TSUNAMI"                  "VOLCANIC ASH"            
## [45] "WATERSPOUT"               "WILDFIRE"                
## [47] "WINTER STORM"             "WINTER WEATHER"

In our database we check and find out that there are 890 unique present event types after we transform to upper case.

EVT_CURRENT <- unique(toupper(StormData$EVTYPE))
EVT_CURRENT[1:8]
## [1] "TORNADO"               "TSTM WIND"             "HAIL"                 
## [4] "FREEZING RAIN"         "SNOW"                  "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE"              "WINTER STORM"

In order to correctly match the official and present event types we create a distribution matrix, which analyzes how close is given present type to official one.

df_stringdist <- stringdistmatrix(EVT_CURRENT, EVT_OFFIC)
df_stringdist[1:3, ]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]   16    8    7   10   13    9    8   10    6     9     9    14    20
## [2,]   15    9    8    9   11   10    9   10    9     8     8    12    16
## [3,]   19    8    7   11   13    9    9   11    7     8    10    13    21
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]    10     6    10    10    10     6     6     9     9     9     9
## [2,]     9     8    10    10    11     8     9     8     9     8     4
## [3,]     9     5    12    11    11     0     3     7     8     8     7
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## [1,]    16     8    15    12     8     8    14    15    20    10     7
## [2,]    17     9    15    12     7    10    11    11    15    10     8
## [3,]    17     9    15    13     7     7    14    16    22    11     5
##      [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## [1,]     7    12     8    14     0    14    10     4     9     8     8
## [2,]     8    12     5     8     8    16    12     6    11     9     8
## [3,]     5    15    10    15     6    17    12     6    10     9     7
##      [,47] [,48]
## [1,]     9    11
## [2,]    10    11
## [3,]    12    13

The lesser the number, the closer is a present event type to an official one. The number denotes the number of edits it takes for the two strings to become the same.

After that we create Dictionary data frame, which pairs types that are closest to each other.

makeDictionary <- function(matrix, official, present){
        df <- data.frame(Current = as.character(), 
                         Official = as.character(),
                         Distance = as.numeric(),
                         stringsAsFactors = FALSE)
        
        for(i in 1:length(present)){
                index <- match(min(matrix[i, ]), matrix[i, ])
                newEntry <- data.frame(Current = present[i], 
                                       Official = official[index],
                                       Distance = min(matrix[i, ]))
                df <- rbind(df, newEntry)
        }
        df
}

df_dictionary <- makeDictionary(df_stringdist, EVT_OFFIC, EVT_CURRENT)
head(df_dictionary)
##                 Current     Official Distance
## 1               TORNADO      TORNADO        0
## 2             TSTM WIND    HIGH WIND        4
## 3                  HAIL         HAIL        0
## 4         FREEZING RAIN FREEZING FOG        4
## 5                  SNOW        FLOOD        4
## 6 ICE STORM/FLASH FLOOD  FLASH FLOOD       10

We then construct a function to return the closest official event type for each present event type according to the dictionary.

getOfficialEvent <- function(curr_event, dict){
        offic_event <- dict[dict$Current == curr_event, 2]
        offic_event
}

Final proccessing

After we have correct matching dictionary and multiplier, we finalize the data frame and calculate the totals. We take the totals as an indicator for harm and impact on economy.

DF_StormData_Totals <- select(DF_StormData, 
                                EVTYPE, 
                                PROPDMG, 
                                PROPDMGEXP, 
                                CROPDMG, 
                                CROPDMGEXP,
                                INJURIES,
                                FATALITIES) %>%
        mutate(PROPDMGEXP = sapply(PROPDMGEXP, getExp),
               CROPDMGEXP = sapply(CROPDMGEXP, getExp),
               EVTYPE = sapply(EVTYPE, getOfficialEvent, df_dictionary)) %>%
        transmute(INJURIES,
                  FATALITIES,
                  EVTYPE = as.factor(EVTYPE),
                  PROPDMG = PROPDMG * PROPDMGEXP,
                  CROPDMG = CROPDMG * CROPDMGEXP) %>%
        mutate(TOTECONDMG = PROPDMG + CROPDMG,
               TOTHARMS = INJURIES + FATALITIES)

Results

Public health harm

DF_StormData_SumHarms <- group_by(DF_StormData_Totals,
                                    EVTYPE) %>%
        summarise(TOTHARMS = sum(TOTHARMS)) %>%
        filter(TOTHARMS > quantile(TOTHARMS, 0.8)) %>%
         arrange(desc(TOTHARMS))

g <- ggplot(DF_StormData_SumHarms, aes(x = EVTYPE, y = TOTHARMS/10^3, fill = EVTYPE))
g <- g + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=30,hjust=1,vjust=1))
g + labs(title = "Total health harm between the years 1950 and 2011") + labs(x = "") + labs(y = "$ Thousands") + scale_y_continuous(labels=function(y) format(y, big.mark = " "))

In conclusion the TOP 3 weather phenomena with consequences for public health:

DF_StormData_SumHarms[1:3, ]
## # A tibble: 3 x 2
##           EVTYPE TOTHARMS
##           <fctr>    <dbl>
## 1        TORNADO    96997
## 2      HIGH WIND     9286
## 3 EXCESSIVE HEAT     8723

Impact on economy

DF_StormData_SumEconDmg <- group_by(DF_StormData_Totals,
                                        EVTYPE) %>%
        summarise(TOTECONDMG = sum(TOTECONDMG)) %>%
        filter(TOTECONDMG > quantile(TOTECONDMG, 0.8)) %>%
                       arrange(desc(TOTECONDMG))

g <- ggplot(DF_StormData_SumEconDmg, aes(x = EVTYPE, y = TOTECONDMG/10^6, fill = EVTYPE))
g <- g + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=30,hjust=1,vjust=1))
g + labs(title = "Total US dollars between the years 1950 and 2011") + labs(x = "") + labs(y = "$ Millions") + scale_y_continuous(labels=function(y) format(y, big.mark = " "))

In conclusion the TOP 3 weather phenomena with consequences for economy:

DF_StormData_SumEconDmg[1:3, ]
## # A tibble: 3 x 2
##                EVTYPE   TOTECONDMG
##                <fctr>        <dbl>
## 1               FLOOD 151264401300
## 2 HURRICANE (TYPHOON)  75501243800
## 3             TORNADO  57356997190